whelanch_rpe

download whelanch_rpe

of 23

Transcript of whelanch_rpe

  • 7/27/2019 whelanch_rpe

    1/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATETRANSDUCERS

    CHRISTOPHER WHELANOREGON HEALTH & SCIENCE UNIVERSITY

    Abstract. The discovery of novel peptides with useful capabilities orcharacteristics could lead to significant advances in fields such as ma-terials science, nanotechnology, and medicine. However, the large sizeof the sequence search space, combined with the time required to ex-perimentally test or simulate peptide behavior at the molecular level,makes statistical computational approaches attractive. We present a

    novel method for designing peptides based on sequence analysis, and ap-ply it to two problems in peptide design: inorganic binding peptides andantimicrobial peptides. Peptides with the ability to bind to inorganicmaterials have many potential applications including medical devices,nanotechnology, and bone and tooth regeneration. Antimicrobial pep-tides have attracted attention as a potential source of therapeutic agentsdue to the rise of microbes resistant to traditional antibiotics. To designthese peptides, we train a support vector machine classifier that discrim-inates between positive and negative sequences based on the counts ofn-grams of amino acid chemical classes. Using the model learned by theclassifier, we then build weighted finite-state transducers that we cansample or search for novel sequences sharing the characteristics of thepositive training examples. We used this framework to produce a setof putative inorganic binding peptides, which we are testing experimen-

    tally. We also generated novel antimicrobial peptide sequences and usedthird-party prediction services to validate them, with strong initial re-sults. We believe that our framework is flexible and generally applicableto many problems in peptide design.

    1. INTRODUCTION

    1.1. Peptide Design Applications. Designed peptides have potential uses

    in a wide variety of applications, from medicine to materials science and nan-

    otechnology. However, the design of new peptides is made difficult by the

    large search space and incomplete chemical knowledge. Each position in a

    peptide sequence can potentially hold any of the 20 standard amino acid

    residues. Therefore, the search space grows exponentially as the length of1

  • 7/27/2019 whelanch_rpe

    2/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 2

    the sequence increases, making an exhaustive search impossible; for a se-

    quence of length 30, there are are 2030 1039 possible sequences. This

    makes it difficult to search for novel peptides with processes that involve ex-

    perimental testing or computationally expensive methods such as molecular

    dynamics simulations. In addition, it is still impossible to accurately predict

    the complete structure of a peptide given its sequence. In applications in

    which researchers do not have an experimentally verified three dimensional

    structure of a known peptide to use as a template, statistical and machine

    learning approaches could be very useful in the peptide design process. We

    have built such a system for designing novel peptides and applied it to two

    problems in peptide design: inorganic binding and antimicrobial peptides.

    1.2. Inorganic Binding Peptides. Peptides that are capable of binding

    inorganic materials such as metal or quartz have potential uses in a large

    number of applications. Many organisms found in nature incorporate inor-

    ganic materials into tissues such as such as bone, teeth, and shells. To build

    these structures, they use proteins and enzymes that direct their assembly at

    a nanoscale level [1]. Understanding and reverse engineering these processes

    could lead the way to many advances in engineering and medicine. For ex-

    ample, bone and tooth regeneration might be possible if we could recreate

    the chemical machinery that directs their formation. Other applications

    include the construction of nanoscale electronic and photonic devices. A

    first step towards these goals is the discovery and understanding of peptides

    capable of binding to inorganic substances. High throughput combinatorial

    techniques such as phage display [2], in which a large number of bacterio-

    phages are induced to express mutated peptides on their exteriors and then

    screened for ability to bind to a surface, have begun to provide data sets that

    can be used for statistical analysis of inorganic binding peptide sequences.

  • 7/27/2019 whelanch_rpe

    3/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 3

    1.3. Antimicrobial Peptides. AMPs have attracted considerable atten-

    tion as a potential new source of therapeutic agents effective against microor-

    ganisms that have developed resistance to traditional drugs [3]. Researchers

    have recently published several databases of AMPs, such as ADP2 [4] and

    CAMP [5], allowing the creation of large data sets for the computational

    analysis of AMPs. These data sets can be used to learn patterns in AMP

    sequences, which researchers can then use to design novel AMPs. For ex-

    ample, Wang et al. [4] used the most common motifs in the ADP2 database

    to hand-design a peptide that exhibited strong activity against E. Coli.

    1.4. Overview of the Paper. In Section 2, I give an overview of the nec-

    essary components of a computational peptide design system, summarize

    previous approaches, and introduce our approach. In Section 3, I discuss

    the details of our approach, and our application of it to both the inorganic

    binding peptide and antimicrobial peptide design problems. Section 4 con-

    tains preliminary results, including an assessment of the performance of the

    classifier used in our system, a description of the planned experimental ver-

    ification of our generated inorganic binding peptides, and an assessment of

    designed AMP sequences used third-party computational prediction servers.

    I also examine the features identified by our models as most important for

    both design problems. In Section 5, I summarize our contribution and dis-

    cuss future improvements.

    2. PEPTIDE DESIGN APPROACHES

    2.1. Introduction to Peptide Design Solutions. Computational pep-

    tide design approaches must have three components: a method for gener-

    ating candidate sequences, a feature set to use to characterize sequences,

    and a method for scoring candidate sequences. The first component must

    produce a set of novel sequences, either randomly or using a probabilistic

  • 7/27/2019 whelanch_rpe

    4/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 4

    or rule based system based on a model of the desired class of peptides.

    The system must then extract a set of features for each sequence for use in

    the final component, which is a scoring method that has been trained on

    known sequences. The complete system generates novel sequences, extracts

    their features, and scores them. Researchers then select the highest scor-

    ing candidates for experimental validation. Some approaches then use these

    experimental results to iteratively refine either their scoring model or their

    sequence generation model.

    2.2. Past approaches. Oren et al. [6] used a computational method to

    search for novel peptides that bind to inorganic materials. For candidate

    generation, they randomly generated 1,000,000 peptide sequences. Using a

    clustering approach, they scored candidates based on their distances from

    sets of known strong and weak binders, as measured by global pairwise

    alignments. They trained their scorer, in this case the substitution matrix

    used in the alignments, using a greedy stochastic search. Because the score

    was based on sequence alignment, the feature set used in this approach was

    the raw amino acid sequence. The model successfully designed several strong

    and weak binding peptides.

    In another peptide design approach, a group of researchers at the Uni-

    versity of British Columbia have built a scoring method for novel AMPs

    using artificial neural networks [7, 8]. The authors created feature sets

    using two-dimensional quantitative structure-activity relationship (QSAR)

    descriptors. QSAR descriptors quantify the chemical structure of a peptide

    based on the the properties of amino acids in the sequence. To generate

    novel sequences for scoring, the authors sampled probability distributions

    that modeled the likelihood of a given amino acid appearing at a particular

    location in the sequence.

  • 7/27/2019 whelanch_rpe

    5/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 5

    Looseet al. [9] generated novel AMPs using a linguistically inspired ap-

    proach. To generate sequences, they built a set of regular grammars based on

    a database of known AMPs, and then exhaustively enumerated the language

    defined by those grammars. They then clustered candidate sequences and

    gave representatives from each cluster a score based on the number of known

    AMPs that shared a rule in the grammar with the candidate sequence.

    Socolich et al. [10] built a scoring model for WW domains, which are

    proteins with a particular type of fold, based on coupling constraints

    between positions in sequences aligned using multiple sequence alignment

    (MSA). For an example of a coupling constraint, consider an MSA in which

    the amino acid in a given position is always polar when the acid in another

    fixed position is non-polar. The authors used a simulated annealing pro-

    cedure to find sequences which minimized the difference between coupling

    constraints in the test set and the novel sequences, and experimentally ver-

    ified several generated peptides. Thomas et al. [11] used the same feature

    set for this problem, but designed a more complex generation method. The

    authors trained a probabilistic graphical model (PGM) to learn the AA cou-

    pling constraints, which they encoded as edges in the model. To generate

    sequences from their PGM, the authors developed two sampling methods.

    They scored their sampled sequences based on their log likelihoods in the

    PGM.

    The approaches described above are summarized in Table 1. This sum-

    mary demonstrates that it is possible to characterize peptide design ap-

    proaches based on their sequence generation method, scoring method, and

    chosen feature sets. Analyzing approaches in this way may help to differen-

    tiate the strengths and shortcomings of various approaches, and encourages

    a modular view of the problem. It may be possible in the future to combine

    candidate generation methods with scoring methods based on their results

  • 7/27/2019 whelanch_rpe

    6/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 6

    Target Generation Method Scoring Method Feature SetInorganic binding [6] Random Alignment score SequenceAMPs [7] AA distribution ANN QSAR

    AMPs [8] AA distribution ANN QSARAMPs [9] Regular grammars Grammar Matching SequenceWW Domains [10] Simulated Annealing Residue Correlation MSA couplingWW Domains [11] Sampling of PGM PGM Log likelihood MSA coupling

    Table 1. Selection of recent approaches for computational pep-tide design. Abbreviations are AMP: Antimicrobial peptide; AA:amino acid; ANN: Artificial neural networks; QSAR: Quantitativestructure-activity relationship; MSA: Multiple Sequence Align-ment; PGM: Probabilistic graphical model.

    in similar peptide application domains. In the next section, we describe our

    approach, which features a feature set and scoring method designed to take

    advantage of recent machine learning techniques for biological sequences,

    and incorporates a novel generation method that we believe will be more

    efficient at suggesting new peptide sequence candidates than the methods

    described above.

    2.3. Overview of Proposed Solution. We propose a method for de novo

    peptide design based on learning from n-gram counts of classes of amino

    acid residues, and then using weighted finite-state transducers to produce

    sequences that include those features that are strongly associated with the

    desired class of peptides. Feature mappings based on n-gram counts are

    analogous to representations of sequences in the frequency domain [12], and

    have been used successfully in tasks such as protein remote homology de-

    tection [13], the problem of detecting similarities between proteins from

    different organisms. In our application, we attempt to learn a set of weights

    that describe how each n-gram feature is associated with the target peptide

    class, and then use those weights to generate new sequences. The latter task

    is made difficult, however, by the fact that most vectors ofn-gram feature

    counts do not represent valid peptide sequences. For example, a feature vec-

    tor that has a positive count for a trigram feature must also have positive

  • 7/27/2019 whelanch_rpe

    7/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 7

    counts for the bigrams and unigrams contained within the trigram; other-

    wise, it cannot represent a valid sequence. If the weight associated with the

    trigram feature is high, but the weight associated with the bigram features

    is low or negative, one cannot simply increase the count of the trigram fea-

    ture while decreasing the count of the bigram features. Weighted finite-state

    transducers (WFSTs) are a potential solution for this problem in sequence

    design. Commonly used in speech recognition and natural language process-

    ing [14, 15], they can be used to build a weighted lattice of sequences that

    can be searched or sampled using efficient algorithms to yield sequences rich

    in a desired set ofn-gram features, while still being valid sequences.

    3. METHODS

    3.1. Construction of Inorganic Binding Peptides Data Set. In our

    initial training phase, we used the 39 training examples from Oren et al.

    [6] to train our scoring model. All sequences were of length 12. Oren et

    al. characterized these sequences according to their quartz binding affinity

    as either strong (10 sequences), moderate (14 sequences), or weak binders(15 sequences). We used the 10 strong binders as positive training examples

    and the 29 moderate and weak binders as negative training examples. We

    initially trained our model using these 39 examples, and then tested the

    scoring model on the 10 novel peptides generated by Oren et al. Because

    the data set was small, for the sequence generation phase of our approach

    we retrained on original training set of 39 plus the two strongest and two

    weakest of the 10 novel sequences, for a total of 43 sequences: 12 positive

    examples and 31 negative examples.

    3.2. Construction of Antimicrobial Peptides Data Set. We down-

    loaded all experimentally verified AMPs from the CAMP database as of

    March 8, 2010, yielding a set of 1,187 peptide sequences. After removing all

  • 7/27/2019 whelanch_rpe

    8/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 8

    ( )

    Figure 1. An example of how we map peptide sequences to fea-tures. The substring DWP contributes to the count of the trigramfeaturef3, which represents subsequences of classes Acidic, Aro-matic, Cyclic.

    sequences that contained the nonstandard amino acid letters B, X, and Z,

    and then extracting representative sequences using the CD-HIT clustering

    server [16] with a sequence identity parameter of 0.9, we were left with a data

    set of 862 AMP sequences, with a mean length of 34 amino acid residues

    and a median length of 30. It is difficult to create a negative training set

    of experimentally verified non-AMPs, so we followed Thomas et al. [5] in

    noting that AMPs are generally secreted from cells, and downloaded a set

    of human protein sequences from the UniProt database that were between

    twenty and fifty amino acids in length, not annotated as antimicrobial, and

    not annotated as secreted. This gave us a set of 1,224 negative training

    examples. We randomly split the data, putting 70% in a training set and

    30% in a test set.

    3.3. Extracting Features and Training an SVM Classifier. To build

    a feature space and classifier for our peptide sequences, we define a set

    of 13 classes to represent the chemical properties of amino acid residues:

    acidic, cyclic, aliphatic, aromatic, basic, buried, charged, hydrophobic, large,

    medium, small, non-neutral, and polar. We define unigram, bigram, and

    trigram features to be the ordered set of classes contained in a subsequence

    of one, two, or three amino acids. For each peptide sequence, we count the

    number of times each feature appears, as shown in Figure 1.

    Given a set of positive and negative training examples, we compute vectors

    of feature counts and train an SVM using the SVMLight package [17] V6.02.

  • 7/27/2019 whelanch_rpe

    9/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 9

    (a) Feature Machine

    0

    f1/-4.401e-05

    f2/-0.00479

    f3/0.00515

    f4/-0.00492

    f5/0.00492

    (b) Scorer

    Figure 2. Portions of the finite-state machines that generate newsequences. 2(a): A portion of the finite-state transducer that com-putes the list of features contained in a sequence, showing pathsthat can be taken from the state that represents a trigram historyof CW. In one path the machine accepts the amino acid V asinput, and emits the features f3,f6, and f8, before proceeding tothe state that indicates that the history is is now WV. On theother path the machine accepts the amino acid R, and emits thefeaturesf5,f7, andf9. The symbol represents the empty string.2(b): A finite-state acceptor that assigns scores to features.

    Training an SVM produces a linear classifier in the feature space, defined

    by

    wT(x) + b,

    where (x) is the mapping of a sequence to the feature space and w is a

    weight vector that describes the decision boundary hyperplane in the feature

    space. It has been shown that the distance from a point in the feature space

    to the separating hyperplane can be mapped by a sigmoid function to an

    estimate of the probability that the point will belong to the positive or

    negative class [18]. Therefore, from our trained SVM we extract w, which

    indicates the direction in the frequency feature space that we hypothesize

    contain sequences that are more likely to be positive examples.

    3.4. Building a WFST to Generate New Sequences. A weighted

    finite-state transducer is an automaton in which each transition between

    states is associated with an input symbol, an output symbol, and a weight.

  • 7/27/2019 whelanch_rpe

    10/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 10

    Formally, they are defined as 8-tuples (, , Q , I , F , E , , ), where is an

    input alphabet, is an output alphabet, Q is a finite set of states, I is the

    set of initial states, F is the set of final states, E is the set of transitions

    between states, is a weight function for initial states, and is a weight

    function for final states. The input and output alphabets are augmented

    with a symbol which represents the empty string, allowing transitions to

    not input or output a symbol. If no outputs are associated with the tran-

    sitions then the machine is referred to as a weighted finite-state acceptor

    (WFSA). Examples of WFSTs and WFSAs can be seen in Figure 2. An

    important operation on WFSTs is composition, denoted by . In compo-

    sition, the outputs of one WFST are fed to the inputs of a second WFST

    or WFSA. Efficient implementations of composition allow the construction

    of complex models based on a set of simple machines [19].

    We use WFSTs to generate novel peptide sequences that will score well ac-

    cording to the weight vector learned by our classifier. To do so, we compose

    together three finite-state machines to build a WFST capable of generating

    our desired sequences. Our first machine, F, is an unweighted transducer

    that maps from a sequence of amino acids to a sequence of tokens represent-

    ing features, as shown in Figure 2(a). Our second machine,S, is a WFSA

    that provides a score for a given sequence of features. This machine is built

    using the weight vector from the SVM classifier (Figure 2(b)). Each arc

    accepts a single feature token fiwith a weight equal to wi, where wiis the

    value of the classifier weight vector for that feature and is a parameter set

    when building the model. After applying a weight pushing algorithm and

    normalizing, the weights within the machine are treated as log probabili-

    ties; therefore, the parameter can be used to vary the peakedness of the

    probability distribution over generated sequences because wi

    is a factor in

    the probability of a sequence. The third machine, T, is a simple transducer

  • 7/27/2019 whelanch_rpe

    11/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 11

    that accepts and outputs any sequence of length m; this machine constrains

    the length of our generated sequences. We use a value for m of 12 for the

    inorganic binding peptide problem, since 12 a standard length for designed

    inorganic binding peptides. For AMPs, we set m to 30, the median length

    of the AMP positive training set.

    We build our final machine by composing the three sub-machines:

    T F S

    We then determinize the paths through this transducer and normalize the

    scores of each path. This produces a WFST that accepts amino acid se-

    quences of lengthmwith a score given by summing up the individual weights

    of every feature contained within the sequence times . We build our finite-

    state machines using the open source OpenFST library, version 1.1 [20].

    In addition to implementing the algorithms needed to produce the trans-

    ducer as described above, OpenFST provides tools that can search for the

    highest scoring sequences accepted by the machine, and can sample from

    high-scoring sequences probabilistically, by treating the scores of each tran-

    sition within the machine as a negative log probability. Random sampling

    adds diversity to our results, which is desireable because the highest scor-

    ing sequences generated by the model are often permutations of the same

    motifs.

    3.5. Addition of a Language Model for Inorganic Binding Peptides.

    Because of the small size of the inorganic binding peptide training set, we

    wanted to ensure that our model did not overfit and could generalize to

    produce novel chemically and biologically plausible sequences. To do so,

    we build a machine L, which models the probabilities of peptide sequences

    based on published sequences that are known to have some binding function.

  • 7/27/2019 whelanch_rpe

    12/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 12

    This machine is a finite-state language model for amino acid sequences,

    of the type commonly used in speech recognition and language processing

    [15]. We create a training set for L by downloading all protein sequences

    from the Gene Ontology database [21] that are annotated with the term

    binding. Our current language model training set was downloaded from

    the 2010-01-10 seqdb release of the Gene Ontology database, which contains

    187,269 sequences that are directly or indirectly associated with the term

    binding. By extracting uniquely named sequences from this release, we

    produced a training set of 64,894 unique protein sequences that we used to

    train our model. To build the language model we used a set of functions

    built on the OpenFST toolkit. The language model was built using an n-

    gram order of 4, so that the probability of an amino acid appearing in the

    sequence is conditional on the previous three amino acids. We use Witten-

    Bell smoothing to estimate the probabilities of sequences of amino acids that

    do not appear in the training data. Finally, we rescale the language model

    probabilities using an additional parameter , by which we multiply each

    transition weight inL. Much like the parameter defined above, controls

    the peakedness of the probability distribution defined by the language

    model; we use it to increase or decrease the effect of the language model on

    the sequences generated by our system. In this case, we build our complete

    machine by composing the four sub-machines:

    L T F S

    This produces a weighted finite-state acceptor that accepts sequences of

    12 amino acids with a score given by summing up the individual weights

    of every feature contained within the sequence times . The probability of

    generating a sequence is affected by both the score of the sequence in the

  • 7/27/2019 whelanch_rpe

    13/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 13

    SVM classifier and the probability of the sequence according to the language

    model.

    3.6. Parametrization of the Model and Generation of New Se-

    quences. Because of the different validation methods available for the prob-

    lems of inorganic binding peptide and AMP design, we used different ap-

    proaches to set parameters and generate sequences for testing. For inorganic

    binding peptides, we had a very small training set and had to select a small

    number of designed sequences for experimental testing; therefore, we first

    chose the parameters for our model based on observations of the sequences

    generated by a variety of parameter combinations. We then selected the

    highest scoring candidate peptides from a model built with the best combi-

    nation of parameters. The existence of third-party computational prediction

    servers for AMPs, on the other hand, allowed us to generate and test large

    batches of sequences simultaneously, albeit without the certainty of exper-

    imental verification. Therefore, we generated large groups of sequences for

    several models, as well as groups of control sequences for comparison.

    3.6.1. Selection of Inorganic Binding Peptide Sequences. In addition to the

    language model described in Section 3.5, we had two additional strategies for

    helping our model to generalize even with the small training set of inorganic

    binding peptides. First, we chose parameters which maximized the diversity

    of sequences generated as well as their scores within the model, so that we

    would be more likely to have a successful result among our selected set.

    Additionally, we used a distance constraint to filter out sequences that were

    too different from the known positive examples.

    To choose parameters, we sampled sets of 2000 sequences for each value

    of the model peakedness parameter in [1, 2, 3, 5, 6, 7, 8, 10, 11], and the

  • 7/27/2019 whelanch_rpe

    14/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 14

    language model strength parameter in [0.75, 1, 1.5, 2, 3]. For each combi-

    nation of parameters, we computed the average score for all 2000 sequences

    in the SVM classifier, as well as the average score given by the QUARTZ2

    matrix designed by Oren et al. [6]. Greater values of and produce se-

    quences with higher scores in both scoring systems, as do smaller values of

    . However, the diversity of the sequences created decreases in both cases.

    Based on a subjective analysis of the scores and diversity of the sequences

    created, we chose a value for of 3 and a value for of 0.75; this set of

    parameters produced consistently high scoring sequences while maintaining

    population diversity in the generated set. POPDIV [22] is an algorithm for

    computing the population diversity of a sample of peptide sequences. Our

    final parameter choice yielded a set of sequences with high scores but sig-

    nificant population diversity as measured by POPDIV; with larger values of

    , population diversity drops quickly to insignificant levels.

    To select our final candidates for testing, we added a final filtering step to

    our process that ensured that our candidate sequences were similar to the

    positive training examples. We first generated 100,000 random sequences.

    For each sequence, we calculated the euclidean distance between it and the

    centroid of the set of positive training examples in the feature space. In

    other words, we calculated the distance d for a candidate sequence x as:

    d(x) =

    (x) |P|i=1 (Pi)|P|

    2,

    wherePis the set of positive training examples and is the mapping from

    a sequence to the n-gram feature space. By discarding sequences for which

    d(x) > for some , we constrained our search to find sequences that lie

    within a region centered on the average positive training example. Based on

    comparisons of the scores and diversities of sequences at various distances,

  • 7/27/2019 whelanch_rpe

    15/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 15

    we chose a value for of 40. We discarded all sequences with distances

    greater than 40, and then chose the 50 remaining sequences with the highest

    scores in our SVM classifier as candidates for experimental validation. To

    further test the properties of our system, we also generated sequences with a

    value of -1. This inverts the weights and should produce peptide sequences

    that have little to no inorganic binding affinity. To produce this set of

    predicted weak binders, we chose the 20 lowest scoring sequences from a

    sample of 100,000. We also included the lowest possible scoring sequence in

    the model, found using a best path search of the WFST: AIRGIRGIRGIR.

    3.6.2. Creation of Sets of Antimicrobial Peptide Sequences. We created five

    sets of novel AMP sequences for testing, each of which consisted of 2,000

    novel peptide sequences of length 30. The first, RAND, was a control set,

    consisting of amino acid residues chosen randomly at each position. As a sec-

    ond control, we also created a set AADIST, which was generated by setting

    each amino acid independently based on the distribution of amino acids in

    the CAMP database. We then sampled from our WFST to build three sets

    WFST1, WFST2, and WFSTNEG, with the peakedness parameter set to

    1, 2, and -1, respectively. The group WFSTNEG exists to demonstrate the

    ability of the method to solve the inverse design problem: creating sequences

    which are unlikely to be AMPs. For each of our WFST generated groups,

    we verified that no sequences shared more than 0.9 sequence identity using

    the CD-HIT clustering web service [16].

    3.7. Experimental Validation of Inorganic Binding Peptides. From

    our final set of candidate inorganic binding peptides, our collaborators in

    the GEMSEC group at the University of Washington have selected several

    peptides for synthesis and experimental validation. Their choices are based

  • 7/27/2019 whelanch_rpe

    16/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 16

    on factors such as difficulty of synthesis and expected yield, properties of

    sequences that could not be learned computationally from the data available.

    3.8. Third-party Prediction Servers for AMP Sequences. To eval-

    uate the novel AMP sequences produced by our system, we used the pre-

    diction server provided by the CAMP database website [5]. Although it is

    not a substitute for experimental validation, using a computational predic-

    tion technique allows rapid testing of large sets of peptides, and is therefore

    useful in validating our approach. The CAMP server classifies peptide se-

    quences as AMPs using three methods: support vector machines, random

    forests (RF), and discriminant analysis (DA). Thomas et al. showed that

    these classifiers perform well on a test data set, with overall accuracies of

    91.5%, 93.2%, and 87.5%, respectively. The CAMP predictors use a fea-

    ture set composed of a variety of features including amino acid composition,

    average hydrophobicity and hydrophilicity of the peptide, transition and

    composition of groups based on reduced amino acid alphabets, and di- and

    tri-peptide composition based on hydrophobicity. Therefore, there is partial

    but not complete overlap with the feature set used in our SVM classifier

    and WFSTs. Even though this overlap may make it easier for our method

    to produce sequences that CAMP classifies as AMPs, we believe that the

    construction of sequences that score highly in a third-party classification

    system is a valuable demonstration of our approach.

    4. RESULTS

    4.1. Performance of SVM Classifiers. We began evaluating our system

    by testing the SVM classifiers we trained for each peptide design problem

    on a held out test set. Because we used the feature weights from the SVM

    to build our WFSTs, our system depends on the SVMs ability to correctly

    classify unseen sequences as positive or negative examples.

  • 7/27/2019 whelanch_rpe

    17/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 17

    0.4 0.2 0.0 0.2 0.4

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    Distance from hyperplane

    Dippositionshift(nm)

    Predicted Strong BindersPredicted Weak Binders

    Figure 3. Experimental binding scores of 10 test sequences fromOren et al. [6]. The sequences are arranged by their distancefrom the hyperplane score in the SVM classifier. The Y axis showsthe Dip position shift after six minutes as measured by Oren etal., a measurement of binding ability. Sequences that Oren etal. predicted to be strong binders are light lines; predicted weak

    binders are dark.

    4.1.1. Inorganic Binding Peptides. When trained on the 39 training exam-

    ples from Oren et al. [6], our classifier successfully predicts strong and weak

    binders when evaluated against the 10 novel peptides reported in that paper

    (Figure 3). In addition, the distance from the hyperplane predicts a portion

    of the observed binding ability (R2 = 0.52, p = 0.02). Although this is a

    small data set, this strong performance is encouraging.

    4.1.2. Antimicrobial Peptides. We evaluated the SVM classifier from the

    AMP model against our held-out training set of 259 positive and 367 neg-

    ative examples. Our classifier had an area under the ROC curve of 0.931.

    This indicates that our feature set of n-gram counts of amino acid classes

  • 7/27/2019 whelanch_rpe

    18/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 18

    NumberofPositivePredictions

    0

    500

    1000

    1500

    2000

    RAND AADIST WFST1 WFST2 WFSTNEG

    ClassifierSVM

    RF

    DA

    Figure 4. Number of sequences in a sample of 2000 predictedto be AMPs for different sequence generation methods. A valueof 2000 indicates that all of the generated sequences were pre-dicted to be AMPs; a score of 0 indicates that none were predictedto be AMPs. RAND: Sequences generated completely randomly.AADIST: Sequences generated according to the distribution of

    amino acids in the CAMP database. WFST1: Sequences gener-ated using a WFST with a peakedness parameter = 1. WFST2:Sequences generated using a WFST with = 2. WFSTNEG: Se-quences generated using a WFST with = -1. Results are shownfor all three prediction methods available at the CAMP databaseserver: SVM: support vector machine. RF: Random Forest. DA:Discriminant Analysis.

    provides sufficient information to train an SVM classifier with strong per-

    formance in the AMP problem domain.

    4.2. Validation of Generated Sequences. Our inorganic binding se-

    quences are being tested experimentally by the GEMSEC group at the Uni-

    versity of Washington at the time of this writing. However, we have been

    able to test our AMP sequences using computational prediction servers.

  • 7/27/2019 whelanch_rpe

    19/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 19

    Feature WeightTryptophan 0.027

    Buried-Buried-Cyclic 0.022Buried-Cyclic-Aromatic 0.019Medium-Aromatic-Aliphatic 0.019Aliphatic-Medium-Aromatic 0.018Hydrophobic-Polar-Medium -0.018Polar-Small-Buried -0.020Hydrophobic-Medium-Aliphatic -0.021

    Table 2. Highest and lowest scoring features in the inorganicbinding model

    4.2.1. Antimicrobial Peptides. We used the CAMP prediction servers to de-

    termine the number of sequences predicted to be AMPs in our control and

    test groups. The results are shown in Figure 4. Averaged across the three

    prediction measures, only 3.9% of the random control group were predicted

    to be AMPs, while 59.6% of the AADIST control set had positive predic-

    tions. In the WFST1 group, an average of 84.9% were predicted to be

    AMPs. In the group created with a more peaked probability distribution

    over sequences, WFST2, an average of 99.9% of the generated sequences

    were predicted to be AMPs. Finally, in the WFSTNEG group, in which the

    value of the weights was reversed to reward non-AMP features, only 0.48%

    of the generated sequences were predicted to be AMPs.

    4.3. Examining Highly Weighted Features in the Models. To better

    understand the predictions made by our model, we extracted the features

    with the highest and lowest weights for both peptide design domains.

    4.3.1. Inorganic Binding Peptides. Table 2 shows the five highest and three

    lowest weighted features in the inorganic binding peptide model. The impor-

    tance of tryptophan in our model agrees with the results of Oren et al. [6].

    An analysis of the trigrams with strong weights in the model may help us to

    better understand the chemical structures necessary for inorganic binding.

  • 7/27/2019 whelanch_rpe

    20/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 20

    Feature WeightCysteine 0.428

    Isoleucine 0.279Lysine 0.246Aromatic-Buried-Small 0.169Medium-Medium-Aromatic 0.159Threonine -0.323Leucine -0.356Serine -0.380

    Table 3. Highest and lowest scoring features in the AMP model

    4.3.2. Antimicrobial Peptides. The five highest and three lowest weighted

    features for the AMP model are shown in Table 3. As with the inor-

    ganic binding peptides, our model identified unigram features of amino acid

    residues that are overrepresented in AMP sequences; for example, the im-

    portance of cysteine residues in AMPs has been widely studied [23, 24].

    Analysis of the trigram and bigram features identified by the model may

    yield insights into AMP structure and design.

    5. DISCUSSION

    5.1. Conclusions. We have shown that by using the n-gram features of

    chemical classes of amino acid residues and a trained SVM classifier, we can

    produce WFSTs that are capable of generating novel sequences which share

    the same features as the training set. We have applied our solution to two

    problems in peptide design. The SVM classifier component of our system

    performs well on known data sets of positive and negative examples in both

    problem domains. We have produced a set of inorganic binding peptide se-

    quences that we are testing experimentally. We have also generated sets of

    novel AMP sequences, and a third-party classification server predicts that

    a large proportion of these novel sequences will have antimicrobial proper-

    ties. By varying the parameters used to construct our machines, we can

  • 7/27/2019 whelanch_rpe

    21/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 21

    exchange diversity of the generated sequences for a higher likelihood of gen-

    erating novel peptides in a desired class. We believe that this framework is

    a promising approach for novel peptide design. By applying our system to

    both inorganic binding and AMP sequence design, we have shown that it

    may be generalizable to many problems in peptide design.

    5.2. Future Work. Although we believe that the use of third-party com-

    putational prediction servers for AMPs shows that this is a promising ap-

    proach, any attempt to computationally design novel AMPs must eventually

    be validated by synthesizing and testing actual peptides. We will look to do

    so in the future.

    We believe that this framework is generally applicable to many problems

    in peptide design. We originally designed the system for the inorganic bind-

    ing peptide design problem, but found that we could easily adapt it to the

    design of AMPs with minimal changes. Therefore, we are optimistic that it

    could be easily generalized to other applications. One such problem is the

    design of peptides capable of binding to larger proteins, such as G-coupled

    protein receptors, which is an application with many implications in drug

    design.

    We would also like to study the impacts of using different feature map-

    pings for sequences on the performance of the system. Our feature mapping

    is closely related to string kernel methods such as the spectrum kernel [13]

    and the mismatch kernel [25], and could easily be extended to incorporate

    features of other kernels, like the gappy and wildcard [26]. In fact, because

    our approach relates string kernels and weighted finite-state transducers, it

    should be possible to incorporate any rational kernel [27], a class of kernels

    defined in terms of WFSTs that includes most string kernels currently in

    use in computational biology.

  • 7/27/2019 whelanch_rpe

    22/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 22

    6. ACKNOWLEDGEMENTS

    Kemal Sonmez and Brian Roark provided the initial ideas for this project,and contributed invaluable advice during the course of its implementa-

    tion. Kemal Sonmez provided the initial implementation of the peptide

    sequence to feature mapping code. Brian Roark provided tools for building

    the UniProt language model. I am thankful for the willingness of Mehmet

    Sarikaya, Candan Tamerler, and the GEMSEC group at the University of

    Washington to test the novel inorganic binding sequences. I also thank the

    NSF for supporting part of this work through Grant #IIS-0811745.

    References

    [1] C. Tamerler and M. Sarikaya, Molecular biomimetics: nanotechnology and bionan-otechnology using genetically engineered peptides, Philos Transact A Math PhysEng Sci, vol. 367, pp. 170526, May 2009.

    [2] R. H. Hoess, Protein design and phage display, Chem Rev, vol. 101, pp. 320518,Oct 2001.

    [3] H. Jenssen, P. Hamill, and R. E. W. Hancock, Peptide antimicrobial agents, ClinMicrobiol Rev, vol. 19, pp. 491511, Jul 2006.

    [4] G. Wang, X. Li, and Z. Wang, APD2: the updated antimicrobial peptide databaseand its application in peptide design, Nucleic Acids Research, vol. 37, no. Databaseissue, p. D933, 2009.

    [5] S. Thomas, S. Karnik, R. S. Barai, V. K. Jayaraman, and S. Idicula-Thomas, CAMP:

    a useful resource for research on antimicrobial peptides, Nucleic Acids Research,vol. 38, pp. D77480, Jan 2010.

    [6] E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova, U. O. S. Seker, M. Sarikaya, andR. Samudrala, A novel knowledge-based approach to design inorganic-binding pep-tides,Bioinformatics, vol. 23, pp. 281622, Nov 2007.

    [7] A. Cherkasov, K. Hilpert, H. Jenssen, C. Fjell, M. Waldbrook, S. Mullaly, R. Volkmer,and R. Hancock, Use of artificial intelligence in the design of small peptide antibioticseffective against a broad spectrum of highly antibiotic-resistant superbugs, ACSChem Biol, vol. 4, no. 1, pp. 6574, 2008.

    [8] H. Jenssen, C. D. Fjell, A. Cherkasov, and R. E. W. Hancock, QSAR modeling andcomputer-aided design of antimicrobial peptides, J Pept Sci, vol. 14, pp. 1104, Jan2008.

    [9] C. Loose, K. Jensen, I. Rigoutsos, and G. Stephanopoulos, A linguistic model forthe rational design of antimicrobial peptides, Nature, vol. 443, pp. 8679, Oct 2006.

    [10] M. Socolich, S. Lockless, W. Russ, and H. Lee, Evolutionary information for speci-fying a protein fold, Nature, vol. 437, pp. 5128, Jan 2005.

    [11] J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg, Protein design by sampling anundirected graphical model of residue constraints, IEEE/ACM Trans Comput BiolBioinform, vol. 6, pp. 50616, Jan 2009.

    [12] D. Anastassiou, Frequency-domain analysis of biomolecular sequences, Bioinfor-matics, vol. 16, pp. 107381, Dec 2000.

  • 7/27/2019 whelanch_rpe

    23/23

    DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 23

    [13] C. Leslie, E. Eskin, and W. S. Noble, The spectrum kernel: a string kernel for SVMprotein classification,Pac Symp Biocomput, pp. 56475, Jan 2002.

    [14] B. Roark and R. Sproat,Computational Approaches to Morphology and Syntax. Ox-

    ford Surveys in Syntax & Morphology, Oxford: Oxford University Press, 2007.[15] M. Mohri, F. Pereira, and M. Riley, Weighted finite-state transducers in speech

    recognition,Comput Speech Lang, vol. 16, pp. 6988, Jan 2002.[16] Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li, CD-HIT suite: a web server for

    clustering and comparing biological sequences, Bioinformatics, vol. 26, pp. 6802,Mar 2010.

    [17] T. Joachims, Making large-scale SVM learning practical, in Advances in KernelMethods - Support Vector Learning (B. Schlkopf, C. Burges, and A. Smola, eds.),ch. 11, Cambridge, MA: MIT Press, 1999.

    [18] J. C. Platt, Probabilistic outputs for support vector machines and comparisons toregularized likelihood methods, inAdvances in Large Margin Classifiers, pp. 6174,MIT Press, 1999.

    [19] F. C. N. Pereira and M. Riley,Finite-State Devices for Natural Language Processing.Cambridge, Massachusetts: MIT Press, 1997.

    [20] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, OpenFst: A generaland efficient weighted finite-state transducer library, in Proceedings of the NinthInternational Conference on Implementation and Application of Automata, (CIAA2007), vol. 4783 of Lecture Notes in Computer Science, pp. 1123, Springer, 2007.http://www.openfst.org.

    [21] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M.Rubin, and G. Sherlock, Gene ontology: tool for the unification of biology. the GeneOntology Consortium, Nat Genet, vol. 25, pp. 259, May 2000.

    [22] L. Makowski and A. Soares, Estimating the diversity of p eptide populations fromlimited sequence data, Bioinformatics, vol. 19, pp. 4839, Mar 2003.

    [23] J. Dimarcq, P. Bulet, C. Hetru, and J. Hoffmann, Cysteine-rich antimicrobial pep-tides in invertebrates, Peptide Science, vol. 47, no. 6, pp. 465477, 1999.

    [24] T. Ganz, Defensins: antimicrobial peptides of innate immunity,Nat Rev Immunol,vol. 3, pp. 71020, Sep 2003.

    [25] C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch string kernels for SVMprotein classification,Advances in Neural Information Processing Systems, pp. 14411448, 2003.

    [26] C. Leslie and R. Kuang, Fast string kernels using inexact matching for proteinsequences,The Journal of Machine Learning Research, vol. 5, Dec 2004.

    [27] C. Cortes, P. Haffner, and M. Mohri, Rational kernels: Theory and algorithms,The Journal of Machine Learning, vol. 5, pp. 10351062, Aug 2004.