Download - whelanch_rpe

7/27/2019 whelanch_rpe

1/23

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATETRANSDUCERS

CHRISTOPHER WHELANOREGON HEALTH & SCIENCE UNIVERSITY

Abstract. The discovery of novel peptides with useful capabilities orcharacteristics could lead to significant advances in fields such as ma-terials science, nanotechnology, and medicine. However, the large sizeof the sequence search space, combined with the time required to ex-perimentally test or simulate peptide behavior at the molecular level,makes statistical computational approaches attractive. We present a

novel method for designing peptides based on sequence analysis, and ap-ply it to two problems in peptide design: inorganic binding peptides andantimicrobial peptides. Peptides with the ability to bind to inorganicmaterials have many potential applications including medical devices,nanotechnology, and bone and tooth regeneration. Antimicrobial pep-tides have attracted attention as a potential source of therapeutic agentsdue to the rise of microbes resistant to traditional antibiotics. To designthese peptides, we train a support vector machine classifier that discrim-inates between positive and negative sequences based on the counts ofn-grams of amino acid chemical classes. Using the model learned by theclassifier, we then build weighted finite-state transducers that we cansample or search for novel sequences sharing the characteristics of thepositive training examples. We used this framework to produce a setof putative inorganic binding peptides, which we are testing experimen-

tally. We also generated novel antimicrobial peptide sequences and usedthird-party prediction services to validate them, with strong initial re-sults. We believe that our framework is flexible and generally applicableto many problems in peptide design.

1. INTRODUCTION

1.1. Peptide Design Applications. Designed peptides have potential uses

in a wide variety of applications, from medicine to materials science and nan-

otechnology. However, the design of new peptides is made difficult by the

large search space and incomplete chemical knowledge. Each position in a

peptide sequence can potentially hold any of the 20 standard amino acid

residues. Therefore, the search space grows exponentially as the length of1


2/23

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 2

the sequence increases, making an exhaustive search impossible; for a se-

quence of length 30, there are are 2030 1039 possible sequences. This

makes it difficult to search for novel peptides with processes that involve ex-

perimental testing or computationally expensive methods such as molecular

dynamics simulations. In addition, it is still impossible to accurately predict

the complete structure of a peptide given its sequence. In applications in

which researchers do not have an experimentally verified three dimensional

structure of a known peptide to use as a template, statistical and machine

learning approaches could be very useful in the peptide design process. We

have built such a system for designing novel peptides and applied it to two

problems in peptide design: inorganic binding and antimicrobial peptides.

1.2. Inorganic Binding Peptides. Peptides that are capable of binding

inorganic materials such as metal or quartz have potential uses in a large

number of applications. Many organisms found in nature incorporate inor-

ganic materials into tissues such as such as bone, teeth, and shells. To build

these structures, they use proteins and enzymes that direct their assembly at

a nanoscale level [1]. Understanding and reverse engineering these processes

could lead the way to many advances in engineering and medicine. For ex-

ample, bone and tooth regeneration might be possible if we could recreate

the chemical machinery that directs their formation. Other applications

include the construction of nanoscale electronic and photonic devices. A

first step towards these goals is the discovery and understanding of peptides

capable of binding to inorganic substances. High throughput combinatorial

techniques such as phage display [2], in which a large number of bacterio-

phages are induced to express mutated peptides on their exteriors and then

screened for ability to bind to a surface, have begun to provide data sets that

can be used for statistical analysis of inorganic binding peptide sequences.


3/23


1.3. Antimicrobial Peptides. AMPs have attracted considerable atten-

tion as a potential new source of therapeutic agents effective against microor-

ganisms that have developed resistance to traditional drugs [3]. Researchers

have recently published several databases of AMPs, such as ADP2 [4] and

CAMP [5], allowing the creation of large data sets for the computational

analysis of AMPs. These data sets can be used to learn patterns in AMP

sequences, which researchers can then use to design novel AMPs. For ex-

ample, Wang et al. [4] used the most common motifs in the ADP2 database

to hand-design a peptide that exhibited strong activity against E. Coli.

1.4. Overview of the Paper. In Section 2, I give an overview of the nec-

essary components of a computational peptide design system, summarize

previous approaches, and introduce our approach. In Section 3, I discuss

the details of our approach, and our application of it to both the inorganic

binding peptide and antimicrobial peptide design problems. Section 4 con-

tains preliminary results, including an assessment of the performance of the

classifier used in our system, a description of the planned experimental ver-

ification of our generated inorganic binding peptides, and an assessment of

designed AMP sequences used third-party computational prediction servers.

I also examine the features identified by our models as most important for

both design problems. In Section 5, I summarize our contribution and dis-

cuss future improvements.

2. PEPTIDE DESIGN APPROACHES

2.1. Introduction to Peptide Design Solutions. Computational pep-

tide design approaches must have three components: a method for gener-

ating candidate sequences, a feature set to use to characterize sequences,

and a method for scoring candidate sequences. The first component must

produce a set of novel sequences, either randomly or using a probabilistic


4/23


or rule based system based on a model of the desired class of peptides.

The system must then extract a set of features for each sequence for use in

the final component, which is a scoring method that has been trained on

known sequences. The complete system generates novel sequences, extracts

their features, and scores them. Researchers then select the highest scor-

ing candidates for experimental validation. Some approaches then use these

experimental results to iteratively refine either their scoring model or their

sequence generation model.

2.2. Past approaches. Oren et al. [6] used a computational method to

search for novel peptides that bind to inorganic materials. For candidate

generation, they randomly generated 1,000,000 peptide sequences. Using a

clustering approach, they scored candidates based on their distances from

sets of known strong and weak binders, as measured by global pairwise

alignments. They trained their scorer, in this case the substitution matrix

used in the alignments, using a greedy stochastic search. Because the score

was based on sequence alignment, the feature set used in this approach was

the raw amino acid sequence. The model successfully designed several strong

and weak binding peptides.

In another peptide design approach, a group of researchers at the Uni-

versity of British Columbia have built a scoring method for novel AMPs

using artificial neural networks [7, 8]. The authors created feature sets

using two-dimensional quantitative structure-activity relationship (QSAR)

descriptors. QSAR descriptors quantify the chemical structure of a peptide

based on the the properties of amino acids in the sequence. To generate

novel sequences for scoring, the authors sampled probability distributions

that modeled the likelihood of a given amino acid appearing at a particular

location in the sequence.


5/23


Looseet al. [9] generated novel AMPs using a linguistically inspired ap-

proach. To generate sequences, they built a set of regular grammars based on

a database of known AMPs, and then exhaustively enumerated the language

defined by those grammars. They then clustered candidate sequences and

gave representatives from each cluster a score based on the number of known

AMPs that shared a rule in the grammar with the candidate sequence.

Socolich et al. [10] built a scoring model for WW domains, which are

proteins with a particular type of fold, based on coupling constraints

between positions in sequences aligned using multiple sequence alignment

(MSA). For an example of a coupling constraint, consider an MSA in which

the amino acid in a given position is always polar when the acid in another

fixed position is non-polar. The authors used a simulated annealing pro-

cedure to find sequences which minimized the difference between coupling

constraints in the test set and the novel sequences, and experimentally ver-

ified several generated peptides. Thomas et al. [11] used the same feature

set for this problem, but designed a more complex generation method. The

authors trained a probabilistic graphical model (PGM) to learn the AA cou-

pling constraints, which they encoded as edges in the model. To generate

sequences from their PGM, the authors developed two sampling methods.

They scored their sampled sequences based on their log likelihoods in the

PGM.

The approaches described above are summarized in Table 1. This sum-

mary demonstrates that it is possible to characterize peptide design ap-

proaches based on their sequence generation method, scoring method, and

chosen feature sets. Analyzing approaches in this way may help to differen-

tiate the strengths and shortcomings of various approaches, and encourages

a modular view of the problem. It may be possible in the future to combine

candidate generation methods with scoring methods based on their results


6/23


Target Generation Method Scoring Method Feature SetInorganic binding [6] Random Alignment score SequenceAMPs [7] AA distribution ANN QSAR

AMPs [8] AA distribution ANN QSARAMPs [9] Regular grammars Grammar Matching SequenceWW Domains [10] Simulated Annealing Residue Correlation MSA couplingWW Domains [11] Sampling of PGM PGM Log likelihood MSA coupling

Table 1. Selection of recent approaches for computational pep-tide design. Abbreviations are AMP: Antimicrobial peptide; AA:amino acid; ANN: Artificial neural networks; QSAR: Quantitativestructure-activity relationship; MSA: Multiple Sequence Align-ment; PGM: Probabilistic graphical model.

in similar peptide application domains. In the next section, we describe our

approach, which features a feature set and scoring method designed to take

advantage of recent machine learning techniques for biological sequences,

and incorporates a novel generation method that we believe will be more

efficient at suggesting new peptide sequence candidates than the methods

described above.

2.3. Overview of Proposed Solution. We propose a method for de novo

peptide design based on learning from n-gram counts of classes of amino

acid residues, and then using weighted finite-state transducers to produce

sequences that include those features that are strongly associated with the

desired class of peptides. Feature mappings based on n-gram counts are

analogous to representations of sequences in the frequency domain [12], and

have been used successfully in tasks such as protein remote homology de-

tection [13], the problem of detecting similarities between proteins from

different organisms. In our application, we attempt to learn a set of weights

that describe how each n-gram feature is associated with the target peptide

class, and then use those weights to generate new sequences. The latter task

is made difficult, however, by the fact that most vectors ofn-gram feature

counts do not represent valid peptide sequences. For example, a feature vec-

tor that has a positive count for a trigram feature must also have positive


7/23


counts for the bigrams and unigrams contained within the trigram; other-

wise, it cannot represent a valid sequence. If the weight associated with the

trigram feature is high, but the weight associated with the bigram features

is low or negative, one cannot simply increase the count of the trigram fea-

ture while decreasing the count of the bigram features. Weighted finite-state

transducers (WFSTs) are a potential solution for this problem in sequence

design. Commonly used in speech recognition and natural language process-

ing [14, 15], they can be used to build a weighted lattice of sequences that

can be searched or sampled using efficient algorithms to yield sequences rich

in a desired set ofn-gram features, while still being valid sequences.

3. METHODS

3.1. Construction of Inorganic Binding Peptides Data Set. In our

initial training phase, we used the 39 training examples from Oren et al.

[6] to train our scoring model. All sequences were of length 12. Oren et

al. characterized these sequences according to their quartz binding affinity

as either strong (10 sequences), moderate (14 sequences), or weak binders(15 sequences). We used the 10 strong binders as positive training examples

and the 29 moderate and weak binders as negative training examples. We

initially trained our model using these 39 examples, and then tested the

scoring model on the 10 novel peptides generated by Oren et al. Because

the data set was small, for the sequence generation phase of our approach

we retrained on original training set of 39 plus the two strongest and two

weakest of the 10 novel sequences, for a total of 43 sequences: 12 positive

examples and 31 negative examples.

3.2. Construction of Antimicrobial Peptides Data Set. We down-

loaded all experimentally verified AMPs from the CAMP database as of

March 8, 2010, yielding a set of 1,187 peptide sequences. After removing all


8/23


( )

Figure 1. An example of how we map peptide sequences to fea-tures. The substring DWP contributes to the count of the trigramfeaturef3, which represents subsequences of classes Acidic, Aro-matic, Cyclic.

sequences that contained the nonstandard amino acid letters B, X, and Z,

and then extracting representative sequences using the CD-HIT clustering

server [16] with a sequence identity parameter of 0.9, we were left with a data

set of 862 AMP sequences, with a mean length of 34 amino acid residues

and a median length of 30. It is difficult to create a negative training set

of experimentally verified non-AMPs, so we followed Thomas et al. [5] in

noting that AMPs are generally secreted from cells, and downloaded a set

of human protein sequences from the UniProt database that were between

twenty and fifty amino acids in length, not annotated as antimicrobial, and

not annotated as secreted. This gave us a set of 1,224 negative training

examples. We randomly split the data, putting 70% in a training set and

30% in a test set.

3.3. Extracting Features and Training an SVM Classifier. To build

a feature space and classifier for our peptide sequences, we define a set

of 13 classes to represent the chemical properties of amino acid residues:

acidic, cyclic, aliphatic, aromatic, basic, buried, charged, hydrophobic, large,

medium, small, non-neutral, and polar. We define unigram, bigram, and

trigram features to be the ordered set of classes contained in a subsequence

of one, two, or three amino acids. For each peptide sequence, we count the

number of times each feature appears, as shown in Figure 1.

Given a set of positive and negative training examples, we compute vectors

of feature counts and train an SVM using the SVMLight package [17] V6.02.


9/23


(a) Feature Machine

0

f1/-4.401e-05

f2/-0.00479

f3/0.00515

f4/-0.00492

f5/0.00492

(b) Scorer

Figure 2. Portions of the finite-state machines that generate newsequences. 2(a): A portion of the finite-state transducer that com-putes the list of features contained in a sequence, showing pathsthat can be taken from the state that represents a trigram historyof CW. In one path the machine accepts the amino acid V asinput, and emits the features f3,f6, and f8, before proceeding tothe state that indicates that the history is is now WV. On theother path the machine accepts the amino acid R, and emits thefeaturesf5,f7, andf9. The symbol represents the empty string.2(b): A finite-state acceptor that assigns scores to features.

Training an SVM produces a linear classifier in the feature space, defined

by

wT(x) + b,

where (x) is the mapping of a sequence to the feature space and w is a

weight vector that describes the decision boundary hyperplane in the feature

space. It has been shown that the distance from a point in the feature space

to the separating hyperplane can be mapped by a sigmoid function to an

estimate of the probability that the point will belong to the positive or

negative class [18]. Therefore, from our trained SVM we extract w, which

indicates the direction in the frequency feature space that we hypothesize

contain sequences that are more likely to be positive examples.

3.4. Building a WFST to Generate New Sequences. A weighted

finite-state transducer is an automaton in which each transition between

states is associated with an input symbol, an output symbol, and a weight.


10/23


Formally, they are defined as 8-tuples (, , Q , I , F , E , , ), where is an

input alphabet, is an output alphabet, Q is a finite set of states, I is the

set of initial states, F is the set of final states, E is the set of transitions

between states, is a weight function for initial states, and is a weight

function for final states. The input and output alphabets are augmented

with a symbol which represents the empty string, allowing transitions to

not input or output a symbol. If no outputs are associated with the tran-

sitions then the machine is referred to as a weighted finite-state acceptor

(WFSA). Examples of WFSTs and WFSAs can be seen in Figure 2. An

important operation on WFSTs is composition, denoted by . In compo-

sition, the outputs of one WFST are fed to the inputs of a second WFST

or WFSA. Efficient implementations of composition allow the construction

of complex models based on a set of simple machines [19].

We use WFSTs to generate novel peptide sequences that will score well ac-

cording to the weight vector learned by our classifier. To do so, we compose

together three finite-state machines to build a WFST capable of generating

our desired sequences. Our first machine, F, is an unweighted transducer

that maps from a sequence of amino acids to a sequence of tokens represent-

ing features, as shown in Figure 2(a). Our second machine,S, is a WFSA

that provides a score for a given sequence of features. This machine is built

using the weight vector from the SVM classifier (Figure 2(b)). Each arc

accepts a single feature token fiwith a weight equal to wi, where wiis the

value of the classifier weight vector for that feature and is a parameter set

when building the model. After applying a weight pushing algorithm and

normalizing, the weights within the machine are treated as log probabili-

ties; therefore, the parameter can be used to vary the peakedness of the

probability distribution over generated sequences because wi

is a factor in

the probability of a sequence. The third machine, T, is a simple transducer


11/23


that accepts and outputs any sequence of length m; this machine constrains

the length of our generated sequences. We use a value for m of 12 for the

inorganic binding peptide problem, since 12 a standard length for designed

inorganic binding peptides. For AMPs, we set m to 30, the median length

of the AMP positive training set.

We build our final machine by composing the three sub-machines:

T F S

We then determinize the paths through this transducer and normalize the

scores of each path. This produces a WFST that accepts amino acid se-

quences of lengthmwith a score given by summing up the individual weights

of every feature contained within the sequence times . We build our finite-

state machines using the open source OpenFST library, version 1.1 [20].

In addition to implementing the algorithms needed to produce the trans-

ducer as described above, OpenFST provides tools that can search for the

highest scoring sequences accepted by the machine, and can sample from

high-scoring sequences probabilistically, by treating the scores of each tran-

sition within the machine as a negative log probability. Random sampling

adds diversity to our results, which is desireable because the highest scor-

ing sequences generated by the model are often permutations of the same

motifs.

3.5. Addition of a Language Model for Inorganic Binding Peptides.

Because of the small size of the inorganic binding peptide training set, we

wanted to ensure that our model did not overfit and could generalize to

produce novel chemically and biologically plausible sequences. To do so,

we build a machine L, which models the probabilities of peptide sequences

based on published sequences that are known to have some binding function.


12/23


This machine is a finite-state language model for amino acid sequences,

of the type commonly used in speech recognition and language processing

[15]. We create a training set for L by downloading all protein sequences

from the Gene Ontology database [21] that are annotated with the term

binding. Our current language model training set was downloaded from

the 2010-01-10 seqdb release of the Gene Ontology database, which contains

187,269 sequences that are directly or indirectly associated with the term

binding. By extracting uniquely named sequences from this release, we

produced a training set of 64,894 unique protein sequences that we used to

train our model. To build the language model we used a set of functions

built on the OpenFST toolkit. The language model was built using an n-

gram order of 4, so that the probability of an amino acid appearing in the

sequence is conditional on the previous three amino acids. We use Witten-

Bell smoothing to estimate the probabilities of sequences of amino acids that

do not appear in the training data. Finally, we rescale the language model

probabilities using an additional parameter , by which we multiply each

transition weight inL. Much like the parameter defined above, controls

the peakedness of the probability distribution defined by the language

model; we use it to increase or decrease the effect of the language model on

the sequences generated by our system. In this case, we build our complete

machine by composing the four sub-machines:

L T F S

This produces a weighted finite-state acceptor that accepts sequences of

12 amino acids with a score given by summing up the individual weights

of every feature contained within the sequence times . The probability of

generating a sequence is affected by both the score of the sequence in the


13/23


SVM classifier and the probability of the sequence according to the language

model.

3.6. Parametrization of the Model and Generation of New Se-

quences. Because of the different validation methods available for the prob-

lems of inorganic binding peptide and AMP design, we used different ap-

proaches to set parameters and generate sequences for testing. For inorganic

binding peptides, we had a very small training set and had to select a small

number of designed sequences for experimental testing; therefore, we first

chose the parameters for our model based on observations of the sequences

generated by a variety of parameter combinations. We then selected the

highest scoring candidate peptides from a model built with the best combi-

nation of parameters. The existence of third-party computational prediction

servers for AMPs, on the other hand, allowed us to generate and test large

batches of sequences simultaneously, albeit without the certainty of exper-

imental verification. Therefore, we generated large groups of sequences for

several models, as well as groups of control sequences for comparison.

3.6.1. Selection of Inorganic Binding Peptide Sequences. In addition to the

language model described in Section 3.5, we had two additional strategies for

helping our model to generalize even with the small training set of inorganic

binding peptides. First, we chose parameters which maximized the diversity

of sequences generated as well as their scores within the model, so that we

would be more likely to have a successful result among our selected set.

Additionally, we used a distance constraint to filter out sequences that were

too different from the known positive examples.

To choose parameters, we sampled sets of 2000 sequences for each value

of the model peakedness parameter in [1, 2, 3, 5, 6, 7, 8, 10, 11], and the


14/23


language model strength parameter in [0.75, 1, 1.5, 2, 3]. For each combi-

nation of parameters, we computed the average score for all 2000 sequences

in the SVM classifier, as well as the average score given by the QUARTZ2

matrix designed by Oren et al. [6]. Greater values of and produce se-

quences with higher scores in both scoring systems, as do smaller values of

. However, the diversity of the sequences created decreases in both cases.

Based on a subjective analysis of the scores and diversity of the sequences

created, we chose a value for of 3 and a value for of 0.75; this set of

parameters produced consistently high scoring sequences while maintaining

population diversity in the generated set. POPDIV [22] is an algorithm for

computing the population diversity of a sample of peptide sequences. Our

final parameter choice yielded a set of sequences with high scores but sig-

nificant population diversity as measured by POPDIV; with larger values of

, population diversity drops quickly to insignificant levels.

To select our final candidates for testing, we added a final filtering step to

our process that ensured that our candidate sequences were similar to the

positive training examples. We first generated 100,000 random sequences.

For each sequence, we calculated the euclidean distance between it and the

centroid of the set of positive training examples in the feature space. In

other words, we calculated the distance d for a candidate sequence x as:

d(x) =

(x) |P|i=1 (Pi)|P|

2,

wherePis the set of positive training examples and is the mapping from

a sequence to the n-gram feature space. By discarding sequences for which

d(x) > for some , we constrained our search to find sequences that lie

within a region centered on the average positive training example. Based on

comparisons of the scores and diversities of sequences at various distances,


15/23


we chose a value for of 40. We discarded all sequences with distances

greater than 40, and then chose the 50 remaining sequences with the highest

scores in our SVM classifier as candidates for experimental validation. To

further test the properties of our system, we also generated sequences with a

value of -1. This inverts the weights and should produce peptide sequences

that have little to no inorganic binding affinity. To produce this set of

predicted weak binders, we chose the 20 lowest scoring sequences from a

sample of 100,000. We also included the lowest possible scoring sequence in

the model, found using a best path search of the WFST: AIRGIRGIRGIR.

3.6.2. Creation of Sets of Antimicrobial Peptide Sequences. We created five

sets of novel AMP sequences for testing, each of which consisted of 2,000

novel peptide sequences of length 30. The first, RAND, was a control set,

consisting of amino acid residues chosen randomly at each position. As a sec-

ond control, we also created a set AADIST, which was generated by setting

each amino acid independently based on the distribution of amino acids in

the CAMP database. We then sampled from our WFST to build three sets

WFST1, WFST2, and WFSTNEG, with the peakedness parameter set to

1, 2, and -1, respectively. The group WFSTNEG exists to demonstrate the

ability of the method to solve the inverse design problem: creating sequences

which are unlikely to be AMPs. For each of our WFST generated groups,

we verified that no sequences shared more than 0.9 sequence identity using

the CD-HIT clustering web service [16].

3.7. Experimental Validation of Inorganic Binding Peptides. From

our final set of candidate inorganic binding peptides, our collaborators in

the GEMSEC group at the University of Washington have selected several

peptides for synthesis and experimental validation. Their choices are based


16/23


on factors such as difficulty of synthesis and expected yield, properties of

sequences that could not be learned computationally from the data available.

3.8. Third-party Prediction Servers for AMP Sequences. To eval-

uate the novel AMP sequences produced by our system, we used the pre-

diction server provided by the CAMP database website [5]. Although it is

not a substitute for experimental validation, using a computational predic-

tion technique allows rapid testing of large sets of peptides, and is therefore

useful in validating our approach. The CAMP server classifies peptide se-

quences as AMPs using three methods: support vector machines, random

forests (RF), and discriminant analysis (DA). Thomas et al. showed that

these classifiers perform well on a test data set, with overall accuracies of

91.5%, 93.2%, and 87.5%, respectively. The CAMP predictors use a fea-

ture set composed of a variety of features including amino acid composition,

average hydrophobicity and hydrophilicity of the peptide, transition and

composition of groups based on reduced amino acid alphabets, and di- and

tri-peptide composition based on hydrophobicity. Therefore, there is partial

but not complete overlap with the feature set used in our SVM classifier

and WFSTs. Even though this overlap may make it easier for our method

to produce sequences that CAMP classifies as AMPs, we believe that the

construction of sequences that score highly in a third-party classification

system is a valuable demonstration of our approach.

4. RESULTS

4.1. Performance of SVM Classifiers. We began evaluating our system

by testing the SVM classifiers we trained for each peptide design problem

on a held out test set. Because we used the feature weights from the SVM

to build our WFSTs, our system depends on the SVMs ability to correctly

classify unseen sequences as positive or negative examples.


17/23


0.4 0.2 0.0 0.2 0.4

0.0

0.5

1.0

1.5

2.0

2.5

Distance from hyperplane

Dippositionshift(nm)

Predicted Strong BindersPredicted Weak Binders

Figure 3. Experimental binding scores of 10 test sequences fromOren et al. [6]. The sequences are arranged by their distancefrom the hyperplane score in the SVM classifier. The Y axis showsthe Dip position shift after six minutes as measured by Oren etal., a measurement of binding ability. Sequences that Oren etal. predicted to be strong binders are light lines; predicted weak

binders are dark.

4.1.1. Inorganic Binding Peptides. When trained on the 39 training exam-

ples from Oren et al. [6], our classifier successfully predicts strong and weak

binders when evaluated against the 10 novel peptides reported in that paper

(Figure 3). In addition, the distance from the hyperplane predicts a portion

of the observed binding ability (R2 = 0.52, p = 0.02). Although this is a

small data set, this strong performance is encouraging.

4.1.2. Antimicrobial Peptides. We evaluated the SVM classifier from the

AMP model against our held-out training set of 259 positive and 367 neg-

ative examples. Our classifier had an area under the ROC curve of 0.931.

This indicates that our feature set of n-gram counts of amino acid classes


18/23


NumberofPositivePredictions

0

500

1000

1500

2000

RAND AADIST WFST1 WFST2 WFSTNEG

ClassifierSVM

RF

DA

Figure 4. Number of sequences in a sample of 2000 predictedto be AMPs for different sequence generation methods. A valueof 2000 indicates that all of the generated sequences were pre-dicted to be AMPs; a score of 0 indicates that none were predictedto be AMPs. RAND: Sequences generated completely randomly.AADIST: Sequences generated according to the distribution of

amino acids in the CAMP database. WFST1: Sequences gener-ated using a WFST with a peakedness parameter = 1. WFST2:Sequences generated using a WFST with = 2. WFSTNEG: Se-quences generated using a WFST with = -1. Results are shownfor all three prediction methods available at the CAMP databaseserver: SVM: support vector machine. RF: Random Forest. DA:Discriminant Analysis.

provides sufficient information to train an SVM classifier with strong per-

formance in the AMP problem domain.

4.2. Validation of Generated Sequences. Our inorganic binding se-

quences are being tested experimentally by the GEMSEC group at the Uni-

versity of Washington at the time of this writing. However, we have been

able to test our AMP sequences using computational prediction servers.


19/23


Feature WeightTryptophan 0.027

Buried-Buried-Cyclic 0.022Buried-Cyclic-Aromatic 0.019Medium-Aromatic-Aliphatic 0.019Aliphatic-Medium-Aromatic 0.018Hydrophobic-Polar-Medium -0.018Polar-Small-Buried -0.020Hydrophobic-Medium-Aliphatic -0.021

Table 2. Highest and lowest scoring features in the inorganicbinding model

4.2.1. Antimicrobial Peptides. We used the CAMP prediction servers to de-

termine the number of sequences predicted to be AMPs in our control and

test groups. The results are shown in Figure 4. Averaged across the three

prediction measures, only 3.9% of the random control group were predicted

to be AMPs, while 59.6% of the AADIST control set had positive predic-

tions. In the WFST1 group, an average of 84.9% were predicted to be

AMPs. In the group created with a more peaked probability distribution

over sequences, WFST2, an average of 99.9% of the generated sequences

were predicted to be AMPs. Finally, in the WFSTNEG group, in which the

value of the weights was reversed to reward non-AMP features, only 0.48%

of the generated sequences were predicted to be AMPs.

4.3. Examining Highly Weighted Features in the Models. To better

understand the predictions made by our model, we extracted the features

with the highest and lowest weights for both peptide design domains.

4.3.1. Inorganic Binding Peptides. Table 2 shows the five highest and three

lowest weighted features in the inorganic binding peptide model. The impor-

tance of tryptophan in our model agrees with the results of Oren et al. [6].

An analysis of the trigrams with strong weights in the model may help us to

better understand the chemical structures necessary for inorganic binding.


20/23


Feature WeightCysteine 0.428

Isoleucine 0.279Lysine 0.246Aromatic-Buried-Small 0.169Medium-Medium-Aromatic 0.159Threonine -0.323Leucine -0.356Serine -0.380

Table 3. Highest and lowest scoring features in the AMP model

4.3.2. Antimicrobial Peptides. The five highest and three lowest weighted

features for the AMP model are shown in Table 3. As with the inor-

ganic binding peptides, our model identified unigram features of amino acid

residues that are overrepresented in AMP sequences; for example, the im-

portance of cysteine residues in AMPs has been widely studied [23, 24].

Analysis of the trigram and bigram features identified by the model may

yield insights into AMP structure and design.

5. DISCUSSION

5.1. Conclusions. We have shown that by using the n-gram features of

chemical classes of amino acid residues and a trained SVM classifier, we can

produce WFSTs that are capable of generating novel sequences which share

the same features as the training set. We have applied our solution to two

problems in peptide design. The SVM classifier component of our system

performs well on known data sets of positive and negative examples in both

problem domains. We have produced a set of inorganic binding peptide se-

quences that we are testing experimentally. We have also generated sets of

novel AMP sequences, and a third-party classification server predicts that

a large proportion of these novel sequences will have antimicrobial proper-

ties. By varying the parameters used to construct our machines, we can


21/23


exchange diversity of the generated sequences for a higher likelihood of gen-

erating novel peptides in a desired class. We believe that this framework is

a promising approach for novel peptide design. By applying our system to

both inorganic binding and AMP sequence design, we have shown that it

may be generalizable to many problems in peptide design.

5.2. Future Work. Although we believe that the use of third-party com-

putational prediction servers for AMPs shows that this is a promising ap-

proach, any attempt to computationally design novel AMPs must eventually

be validated by synthesizing and testing actual peptides. We will look to do

so in the future.

We believe that this framework is generally applicable to many problems

in peptide design. We originally designed the system for the inorganic bind-

ing peptide design problem, but found that we could easily adapt it to the

design of AMPs with minimal changes. Therefore, we are optimistic that it

could be easily generalized to other applications. One such problem is the

design of peptides capable of binding to larger proteins, such as G-coupled

protein receptors, which is an application with many implications in drug

design.

We would also like to study the impacts of using different feature map-

pings for sequences on the performance of the system. Our feature mapping

is closely related to string kernel methods such as the spectrum kernel [13]

and the mismatch kernel [25], and could easily be extended to incorporate

features of other kernels, like the gappy and wildcard [26]. In fact, because

our approach relates string kernels and weighted finite-state transducers, it

should be possible to incorporate any rational kernel [27], a class of kernels

defined in terms of WFSTs that includes most string kernels currently in

use in computational biology.


22/23


6. ACKNOWLEDGEMENTS

Kemal Sonmez and Brian Roark provided the initial ideas for this project,and contributed invaluable advice during the course of its implementa-

tion. Kemal Sonmez provided the initial implementation of the peptide

sequence to feature mapping code. Brian Roark provided tools for building

the UniProt language model. I am thankful for the willingness of Mehmet

Sarikaya, Candan Tamerler, and the GEMSEC group at the University of

Washington to test the novel inorganic binding sequences. I also thank the

NSF for supporting part of this work through Grant #IIS-0811745.

References

[1] C. Tamerler and M. Sarikaya, Molecular biomimetics: nanotechnology and bionan-otechnology using genetically engineered peptides, Philos Transact A Math PhysEng Sci, vol. 367, pp. 170526, May 2009.

[2] R. H. Hoess, Protein design and phage display, Chem Rev, vol. 101, pp. 320518,Oct 2001.

[3] H. Jenssen, P. Hamill, and R. E. W. Hancock, Peptide antimicrobial agents, ClinMicrobiol Rev, vol. 19, pp. 491511, Jul 2006.

[4] G. Wang, X. Li, and Z. Wang, APD2: the updated antimicrobial peptide databaseand its application in peptide design, Nucleic Acids Research, vol. 37, no. Databaseissue, p. D933, 2009.

[5] S. Thomas, S. Karnik, R. S. Barai, V. K. Jayaraman, and S. Idicula-Thomas, CAMP:

a useful resource for research on antimicrobial peptides, Nucleic Acids Research,vol. 38, pp. D77480, Jan 2010.

[6] E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova, U. O. S. Seker, M. Sarikaya, andR. Samudrala, A novel knowledge-based approach to design inorganic-binding pep-tides,Bioinformatics, vol. 23, pp. 281622, Nov 2007.

[7] A. Cherkasov, K. Hilpert, H. Jenssen, C. Fjell, M. Waldbrook, S. Mullaly, R. Volkmer,and R. Hancock, Use of artificial intelligence in the design of small peptide antibioticseffective against a broad spectrum of highly antibiotic-resistant superbugs, ACSChem Biol, vol. 4, no. 1, pp. 6574, 2008.

[8] H. Jenssen, C. D. Fjell, A. Cherkasov, and R. E. W. Hancock, QSAR modeling andcomputer-aided design of antimicrobial peptides, J Pept Sci, vol. 14, pp. 1104, Jan2008.

[9] C. Loose, K. Jensen, I. Rigoutsos, and G. Stephanopoulos, A linguistic model forthe rational design of antimicrobial peptides, Nature, vol. 443, pp. 8679, Oct 2006.

[10] M. Socolich, S. Lockless, W. Russ, and H. Lee, Evolutionary information for speci-fying a protein fold, Nature, vol. 437, pp. 5128, Jan 2005.

[11] J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg, Protein design by sampling anundirected graphical model of residue constraints, IEEE/ACM Trans Comput BiolBioinform, vol. 6, pp. 50616, Jan 2009.

[12] D. Anastassiou, Frequency-domain analysis of biomolecular sequences, Bioinfor-matics, vol. 16, pp. 107381, Dec 2000.


23/23


[13] C. Leslie, E. Eskin, and W. S. Noble, The spectrum kernel: a string kernel for SVMprotein classification,Pac Symp Biocomput, pp. 56475, Jan 2002.

[14] B. Roark and R. Sproat,Computational Approaches to Morphology and Syntax. Ox-

ford Surveys in Syntax & Morphology, Oxford: Oxford University Press, 2007.[15] M. Mohri, F. Pereira, and M. Riley, Weighted finite-state transducers in speech

recognition,Comput Speech Lang, vol. 16, pp. 6988, Jan 2002.[16] Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li, CD-HIT suite: a web server for

clustering and comparing biological sequences, Bioinformatics, vol. 26, pp. 6802,Mar 2010.

[17] T. Joachims, Making large-scale SVM learning practical, in Advances in KernelMethods - Support Vector Learning (B. Schlkopf, C. Burges, and A. Smola, eds.),ch. 11, Cambridge, MA: MIT Press, 1999.

[18] J. C. Platt, Probabilistic outputs for support vector machines and comparisons toregularized likelihood methods, inAdvances in Large Margin Classifiers, pp. 6174,MIT Press, 1999.

[19] F. C. N. Pereira and M. Riley,Finite-State Devices for Natural Language Processing.Cambridge, Massachusetts: MIT Press, 1997.

[20] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, OpenFst: A generaland efficient weighted finite-state transducer library, in Proceedings of the NinthInternational Conference on Implementation and Application of Automata, (CIAA2007), vol. 4783 of Lecture Notes in Computer Science, pp. 1123, Springer, 2007.http://www.openfst.org.

[21] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M.Rubin, and G. Sherlock, Gene ontology: tool for the unification of biology. the GeneOntology Consortium, Nat Genet, vol. 25, pp. 259, May 2000.

[22] L. Makowski and A. Soares, Estimating the diversity of p eptide populations fromlimited sequence data, Bioinformatics, vol. 19, pp. 4839, Mar 2003.

[23] J. Dimarcq, P. Bulet, C. Hetru, and J. Hoffmann, Cysteine-rich antimicrobial pep-tides in invertebrates, Peptide Science, vol. 47, no. 6, pp. 465477, 1999.

[24] T. Ganz, Defensins: antimicrobial peptides of innate immunity,Nat Rev Immunol,vol. 3, pp. 71020, Sep 2003.

[25] C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch string kernels for SVMprotein classification,Advances in Neural Information Processing Systems, pp. 14411448, 2003.

[26] C. Leslie and R. Kuang, Fast string kernels using inexact matching for proteinsequences,The Journal of Machine Learning Research, vol. 5, Dec 2004.

[27] C. Cortes, P. Haffner, and M. Mohri, Rational kernels: Theory and algorithms,The Journal of Machine Learning, vol. 5, pp. 10351062, Aug 2004.