PepPat, a pattern-based oligopeptide homology search method and the identification of a novel...

9
PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide Ying Jiang, 1,2 Ge Gao, 2 Gang Fang, 2 Eric L. Gustafson, 1 Maureen Laverty, 1 Yanbin Yin, 2 Yong Zhang, 2 Jingchu Luo, 2 Jonathan R. Greene, 1 Marvin L. Bayne, 1 Joseph A. Hedrick, 1 Nicholas J. Murgolo 1 1 Bioinformatics/Genomics Research, Schering-Plough Research Institute, 2015 Gallopjng Hill Road, Kenilworth, NJ 07033, USA 2 Centre of Bioinformatics, College of Life Sciences, Peking University, Beijing 100871, P.R. China Received: 2 November 2002 / Accepted: 22 January 2003 Abstract PepPat, a hybrid method that combines pattern matching with similarity scoring, is described. We also report PepPat’s application in the identification of a novel tachykinin-like peptide. PepPat takes as input a query peptide and a user-specified regular expression pattern within the peptide. It first per- forms a database pattern match and then ranks candidates on the basis of their similarity to the query peptide. PepPat calculates similarity over the pattern spanning region, enhancing PepPat’s sensi- tivity for short query peptides. PepPat can also search for a user-specified number of occurrences of a repeated pattern within the target sequence. We illustrate PepPat’s application in short peptide ligand mining. As a validation example, we report the identification of a novel tachykinin-like peptide, C14TKL-1, and show it is an NK1 (neuokinin re- ceptor 1) agonist whose message is widely expressed in human periphery. Availability: PepPat is offered online at: http://peppat.cbi.pku.edu.cn Introduction Sequence similarity database searches not only pro- vide functional information through annotation transfer, but also allow the identification of novel homologous genes or proteins. As more sequence data have become available, sequence database searches have become a commonplace for biologists. Many advanced algorithms have been developed and are available for protein database sequence simila- rity searches, and they play pivotal roles in protein functional annotation or sequence database mining (reviewed by Mount 2001). BLASTP (Altschul et al. 1990) and FASTA (Pearson and Lipman 1988; Pear- son 2000) are popular heuristic local alignment al- gorithms, which use a single query sequence. The PSI-BLAST (Altschul et al. 1997) algorithm, which also uses single query sequence, applies BLAST it- eratively to result in profile-based database search. Compared with blast, PSI-BLAST has greatly in- creased sensitivity and is capable of identifying re- mote homologs. There are also many motif search algorithms, which can search protein sequence databases for proteins that contain a functional motif. These al- gorithms, such as PROSITE (Falquet et al. 2002), HMMER (Durbin et al. 1998), Reverse PSI-BLAST (Wheeler et al. 2002), IMPALA (Schaffer et al. 1999), and EMOTIF (Huang and Brutlag 2001), can accept input functional patterns which are typically repre- sented as regular expressions, scoring matrices, or profiles. They also require a database of preexisting motifs or profiles such as InterPro (Apweiler et al. 2001), PFAM (Bateman et al. 2002), SMART (Letunic et al. 2002), PRINTS (Attwood et al. 2002), PRO- DOM (Corpet et al. 2000), BLOCKS (Henikoff and Greene 2000), CCD (Marchler-Bauer et al. 2002), and DOMO (Gracy and Argos 1998). Similar to those algorithms, there are also regu- lar pattern-matching methods which search for the existence of a query pattern in a sequence database, such as the GCG FindPatterns program (Accelrys, Madison, Wis.) and the general pattern-match- ing package ANREP (Mehldau and Myers 1993). There is also an algorithm that does general pat- tern matching in DNA or RNA (Pesole et al. 2000). Pattern matching allows the comparison of two DOI: 10.1007/s00335-002-3061-y Volume 14, 341349 (2003) •ȑ Springer-Verlag New York, Inc. 2003 341 Correspondence to: Y. Jiang, E-mail: [email protected]

Transcript of PepPat, a pattern-based oligopeptide homology search method and the identification of a novel...

Page 1: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

PepPat, a pattern-based oligopeptide homology search methodand the identification of a novel tachykinin-like peptide

Ying Jiang,1,2 Ge Gao,2 Gang Fang,2 Eric L. Gustafson,1 Maureen Laverty,1

Yanbin Yin,2 Yong Zhang,2 Jingchu Luo,2 Jonathan R. Greene,1 Marvin L. Bayne,1

Joseph A. Hedrick,1 Nicholas J. Murgolo1

1Bioinformatics/Genomics Research, Schering-Plough Research Institute, 2015 Gallopjng Hill Road, Kenilworth, NJ 07033, USA2Centre of Bioinformatics, College of Life Sciences, Peking University, Beijing 100871, P.R. China

Received: 2 November 2002 / Accepted: 22 January 2003

Abstract

PepPat, a hybrid method that combines patternmatching with similarity scoring, is described. Wealso report PepPat’s application in the identificationof a novel tachykinin-like peptide. PepPat takes asinput a query peptide and a user-specified regularexpression pattern within the peptide. It first per-forms a database pattern match and then rankscandidates on the basis of their similarity to thequery peptide. PepPat calculates similarity over thepattern spanning region, enhancing PepPat’s sensi-tivity for short query peptides. PepPat can alsosearch for a user-specified number of occurrences ofa repeated pattern within the target sequence. Weillustrate PepPat’s application in short peptide ligandmining. As a validation example, we report theidentification of a novel tachykinin-like peptide,C14TKL-1, and show it is an NK1 (neuokinin re-ceptor 1) agonist whose message is widely expressedin human periphery. Availability: PepPat is offeredonline at: http://peppat.cbi.pku.edu.cn

Introduction

Sequence similarity database searches not only pro-vide functional information through annotationtransfer, but also allow the identification of novelhomologous genes or proteins. As more sequencedata have become available, sequence databasesearches have become a commonplace for biologists.Many advanced algorithms have been developed andare available for protein database sequence simila-

rity searches, and they play pivotal roles in proteinfunctional annotation or sequence database mining(reviewed by Mount 2001). BLASTP (Altschul et al.1990) and FASTA (Pearson and Lipman 1988; Pear-son 2000) are popular heuristic local alignment al-gorithms, which use a single query sequence. ThePSI-BLAST (Altschul et al. 1997) algorithm, whichalso uses single query sequence, applies BLAST it-eratively to result in profile-based database search.Compared with blast, PSI-BLAST has greatly in-creased sensitivity and is capable of identifying re-mote homologs.

There are also many motif search algorithms,which can search protein sequence databases forproteins that contain a functional motif. These al-gorithms, such as PROSITE (Falquet et al. 2002),HMMER (Durbin et al. 1998), Reverse PSI-BLAST(Wheeler et al. 2002), IMPALA (Schaffer et al. 1999),and EMOTIF (Huang and Brutlag 2001), can acceptinput functional patterns which are typically repre-sented as regular expressions, scoring matrices, orprofiles. They also require a database of preexistingmotifs or profiles such as InterPro (Apweiler et al.2001), PFAM (Bateman et al. 2002), SMART (Letunicet al. 2002), PRINTS (Attwood et al. 2002), PRO-DOM (Corpet et al. 2000), BLOCKS (Henikoff andGreene 2000), CCD (Marchler-Bauer et al. 2002), andDOMO (Gracy and Argos 1998).

Similar to those algorithms, there are also regu-lar pattern-matching methods which search for theexistence of a query pattern in a sequence database,such as the GCG FindPatterns program (Accelrys,Madison, Wis.) and the general pattern-match-ing package ANREP (Mehldau and Myers 1993).There is also an algorithm that does general pat-tern matching in DNA or RNA (Pesole et al. 2000).Pattern matching allows the comparison of two

DOI: 10.1007/s00335-002-3061-y • Volume 14, 341–349 (2003) • � Springer-Verlag New York, Inc. 2003 341

Correspondence to: Y. Jiang, E-mail: [email protected]

Page 2: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

pattern-containing regions to establish the func-tional similarities between two sequences, but doesnot provide the sophistication of statistical analysisor similarity scoring present in profile-based motifsearch algorithms. However, for short peptide min-ing, besides the new version of BLAST, patternmatching could be particular useful if a conservedfunctional pattern could be2 defined and not enoughinformation were available to generate a profile.Moreover, pattern matching is simple and often al-lows the input of a user-defined pattern.

When performing regular expression patternmatching, it is possible that the pattern is necessaryfor but not definitive of function, i.e., the patternspecified is most likely not specific enough. Patternmatching may then produce too many pattern hits,and biologists may have to spend much time to de-termine their relevance. If one can combine similar-ity scoring to a functionally relevant example of thepattern with a ‘soft’ (meaning not very specific) pat-tern matching, hits containing the query pattern canbe ranked based on similarity to the example. Thissimple addition of similarity scoring could thus go along way in helping the curation process, which isexactly one of the major concerns in database mining.This is best illustrated by PHI-BLAST (Pattern HitInitiated-BLAST; Zhang et al. 1998), which is a rapidand sensitive method for pattern-based homologysearches. PHI-BLAST divides a single query sequenceinto the pattern-spanning regions S0 and pattern-flanking regions S1 and S2. PHI-BLAST identifiespattern hits in the S0 region, then calculates simi-larity and performs statistical analysis on the pattern-flanking regions S1 and S2 to rank the hits. Therefore,PHI-BLAST requires significant similarity in regionsflanking the pattern to perform well. If there is notmuch flanking similarity, or if the query peptide andthe specified pattern are short, PHI-BLAST may failto detect homologs. This is often the case in search-ing for homologs of short bioactive oligopeptides. Forthose cases, including similarity scoring in the pat-tern-scanning region is desired.

Complementary to PHI-BLAST (Zhang et al.1998), to achieve the flexibility of single query se-quence with a specified functional pattern for shortoligopeptide homolog identification, we developedPepPat, a hybrid method combining pattern match-ing and similarity scoring. PepPat considers simi-larity over the pattern-spanning region to permit theflexibility in handling a short peptide pattern andallow hits to be ranked according to their similarityto a query example peptide.

Input to the PepPat program consists of a peptidesequence, a user-defined functional pattern (with theoption of having only a query pattern without a

query peptide to perform pattern matching withoutsimilarity scoring), and a subject protein sequencedatabase in multiple fasta format. The functionalpattern can be expressed as a Perl (Wall et al. 1996)regular expression, or a prosite pattern in conjunc-tion with the )p flag (a flag the program requires tospecify patterns in prosite format).3 Using PepPat, onecan also specify the number of times the query pat-tern occurs in subject sequences, which permitsscanning for repeats, which are sometimes found inbiological peptides.

PepPat was applied for peptide ligand mining asa proof of principal. We were able to identify anovel tachykinin-like peptide, C14TKL-1. Tach-ykinins are a family of peptide neurotransmittersthat have important regulatory functions in anumber of biological pathways. In the peripherythey stimulate smooth muscle contraction, stimu-late glandular secretion, induce activation of cellsin the immune system, and activate peripheralnerves. In the CNS they regulate dopaminergicneurons in the basal ganglia and mesolimbic areasof the brain and are involved in the transmission ofthe sensory information, including noxious stimulito the spinal cord. Tachykinins mediate suchfunctions through the binding to their receptors,NK1, NK2, and NK3, which are G protein-coupledreceptors (reviewed by Watson and Arkinstall 1994).In this report, we also demonstrated that C14TKL-1is an agonist for NK1, which is widely expressed inhuman periphery.

Materials and methods

PepPat development and testing.

Public data and software: Drosophila proteome se-quences and the ‘‘non-redundant’’ or ‘‘NR’’ proteindataset were downloaded from NCBI. BLOSUM62and PAM250 scoring matrices were also downloadedfrom NCBI by anonymous ftp. BLAST algorithmswere from NCBI toolbox and compiled locally.

Programming language and operating system forPepPat: PepPat was written in ANSIC and usesstandard Unix-platform system calls. PepPat hasbeen built and tested on Linux (RedHat distribution),Compaq DEC Alpha, Sun Solaris, and IBM AIXplatforms. It can be implemented on standard desk-top workstations and PCs as well.

Using PepPat: The query peptide and the patternmust be of the same length. Once compiled, one cancall the executable to get brief help information asfollows:

342 Y. JIANG ET AL: PEPPAT AND PATTERN SEARCH

Page 3: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

PepPat v0.01beta R2 (build on 08:24:43, Jul 152002) Usage: pp [)v] [)j] [)p] [)m Matrix] [)r Min-Rep] [)s MinSim] pep pat database

)r[MinRep] : set Min Replica)v : verbose mode for debugging)j : jump mode for speed)p : ProSITE pattern mode)m[file] : set scoring matrix file (default:

BLOSUM62))s : set minimal similarity score

(default is 3.33)

PepPat requires specification of three parameters:

• Pep: an oligopeptide sequence used as an examplefor similarity scoring.

• Pat: the functional motif or pattern, on the basis ofwhich PepPat will search the sequence databasefor hits.

• Database: a standard multiple fasta format se-quences file as the subject database.

One can also do pattern scans without similarityscoring by omitting the pep parameter.

Peptide pattern specification: Patterns expressedas Perl (Wall et al. 1996) regular expressions are ac-cepted.

1. A period, ‘.’ is used to represent any amino acid.2. ‘‘.{5}’’ means 5 positions of any amino acid.3. ‘‘[ALY]’’ means one occurrence of A or L or Y.4. ‘‘.{2,4}’’ means 2 to 4 positions where any residues

are allowed (currently, PepPat does not calculatesimilarity scores when gaps are introduced intothe alignment). Prosite pattern specification willbe accepted with the )p option

C14TKL-1 identification, expression, andfunctional activity

Identification of C14TKL-1 (Chromosome 14tachykinin-like peptide 1): A virtual translation ofthe Incyte Lifeseq EST database (Incyte Genomics,Inc., Palo Alto, Calif.) was generated by a proprietary6 frame translation program (not shown). For PepPatsearching, Neurokinin A ‘‘FVGLMGKR’’ was usedas the query peptide, and ‘‘.{4}MG[KR][KR]’’ wasused as the pattern. An EST was identified contain-ing C14TKL-1, and a corresponding htg entry con-taining the sequence in human Chromosome 14 wasidentified by BLAST.

Quantitative PCR: Human tissue autopsy sam-ples were purchased from Zoion Diagnostics(Shrewsbury, Mass.). Postmortem times for tissuecollection ranged from 2 to 6 h. Total RNA wasisolated from the tissues with TRI-reagent (MRC,Cincinnati, Ohio) and tested for quality and quantitywith an Agilent 2100 Bioanalyzer (Waldbroun, Ger-many). Tissues from three donors were used in theanalysis.

Quantitative PCR was carried out with an ABIPrism 7900HT Sequence Detection System (AppliedBiosystems, Foster City, Calif.). Taqman primers andprobes were designed with Primer Express software(ABI), and purchased from ABI. The sequences of theprimers were as follows: C14TKL-1

Forward: 5¢-ACGTGACTGGGTGAGCCAA-3¢;Reverse: 5¢-AGCAGCTGGAAATGTTTGCA-3¢;Probe: 5¢-TGGGCTCTCTTTCTAATTTG-

CATTTG GTGTTG-3¢;

The PCR reactions were prepared with thecomponents from the Invitrogen Platinum Quanti-tative RT-PCR One-Step kit and were assembledaccording to the manufacturer’s instructions (Invi-trogen, Carlsbad, Calif.). The final concentrations ofthe primers and probe in the PCR reactions were 200nM and 100 nM respectively. In addition, 0.25 lL of apassive reference dye (ROX, Invitrogen) was added toeach reaction, and each 12.5 lL PCR reaction con-tained 2.5 lL (25 ng) of total RNA prepared asdescribed above. The RT-PCR reactions were per-formed in a single 384-well plate according to thefollowing protocol: one cycle for 30 min at 48�C,followed by one 20-min cycle at 95�C, followed by 40cycles at 95�C for 15 s and 60�C for 1 min. A separateplate of the same RNAs was used to quantitate 18SRNA as an internal control.

C14TKL-1 activity studies: Human NK1 receptorcells were supplied by J. Krause (Washington Uni-versity, St. Louis, Mo.). C14TKL-1 was custom syn-thesized (Research Genetics, Huntsville, Ala.). Thepeptide was resuspended in water at 1 mg/mL, ali-quoted, and stored frozen prior to use. Receptorscreening was accomplished with the fluorometricimaging plate reader (FLIPR, Molecular Devices,Sunnyvale, Calif.). Briefly, on the day of screening,CHO cells expressing tachykinin receptor wereloaded in suspension for 1 h with Fluo-3AM (SigmaChemical Corp., St. Louis, Mo.) in Hanks buffercontaining 20 mM Hepes and 0.1% BSA. Cells werethen washed once with the same buffer and re-platedinto clear-bottom, black-walled, 96-well plates pre-coated with poly-D-lysine (Becton-Dickinson,Franklin Lakes, N.J.) at a density of 5 · 105 cells/

Y. JIANG ET AL: PEPPAT AND PATTERN SEARCH 343

Page 4: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

well. The plates were centrifuged briefly at lowspeed (300 g) and subsequently assayed in the FLIPR.Cells were exposed to varying concentrations ofpeptide, and the rise in intracellular calcium wasmeasured. Data are expressed as the maximal re-sponse at each concentration of agonist.

Results

The PepPat program. In PepPat, the pattern-matching method is similar to the ones described byKnuth (Knuth et al. 1977) or Baeza-Yates and Gonnet(1992). Patterns are divided into rigid parts, whichare the fixed amino acids at particular positions, andloose parts, where multiple amino acids are allowedat other positions. The rigid parts are searched first;then, local searches for loose pattern elements areperformed. If no rigid parts are specified in the pat-tern originally, PepPat will treat this pattern as acombination of patterns with rigid parts. PepPatscans the rigid pattern with window size 1 fromN-terminus to C-terminus of the database sequences.PepPat also allows a jumping mode, in which casewhen a pattern is identified, PepPat jumps to the endof this identified pattern and starts scanning fromthere again. The jumping mode is fast but ignoresthe possibility of nested pattern hits (more thanone pattern hits overlapping in the subjectsequence).

When a pattern hit is found, PepPat splices outthe pattern containing peptide and aligns it with thequery peptide. This alignment is anchored by thematched pattern, and no gap is permitted. PepPatdoes not allow gaps because it is designed mainly foroligopeptide homology searches. This also simplifiesthe alignment procedure, but requires the length ofthe query pattern to be exactly the same as the querypeptide. PepPat scores and orders database hits basedupon alignment similarities calculated with a sub-stitution-scoring matrix. We developed the programinitially with the Blosum62 (the default matrix;Henikoff and Henikoff 1992) and Pam250 (Dayhoff1978) scoring matrices, although one can specify and

use other matrices. Three scores are employed todescribe a pattern hit: Score, the sum of aminoacid pair score Pi derived from the scoring matrix(Score = SPi); Identity score Identity, the percentageof M, the number of identical amino acid positionsin the alignment, relative to N, the length of thequery peptide (Identity = (M/N) · 100%); and Simi-larity score Similarity, the percentage of the Score ofa hit, Sh relative to the Score of a perfect hit, Sp

(Similarity = (Sh/Sp) · 100%). Since the specifiedpattern itself is usually biologically significant,similarity scoring is important for distinguishingrelevant hits.

The output of PepPat is similar to the HSP (highscoring pairs) alignments returned by BLAST(Altschul et al. 1990), with a vertical bar to indicatethe identical amino acid, a plus sign to indicate thetwo aligned amino acids have a positive score fromthe scoring matrix, and an empty space otherwise.

PepPat does not calculate statistical significanceof pattern hits for two reasons: first, if the querypeptide is short, statistical calculations may not bemeaningful (such as E being set at 100,000 for blastsearch); second, PepPat assumes that the patternspecified has biological significance. A drawback isthat too many hits may be produced. One can limitthe number of hits by setting the similarity cutoffwith the flag ‘)s’ (the default is 3.33).

The application of PepPat in a pattern-based da-tabase search of a short peptide is compared withPHI-BLAST. As shown1 in Table 1, a 6mer peptide,MGLGEM, was artificially chosen; we first specifiedsome conserved amino acids in this 6mer and runPHI-BLAST (pattern ‘‘x(5)M’’); no hits were identi-fied, with the E value being set at 100,000. However,PepPat (pattern ‘‘.{5}M’’) readily identified many hitsand ranked them according to similarity scores. Itshould be pointed out that PepPat allows the speci-fication of the number of pattern re-occurrence.From these data, PepPat seems complementary toPHI-BLAST (Zhang et al. 1998) in dealing with ashort query peptide as well as in specifying a re-peated pattern.

Compare PepPat with BLAST and PHI-BLAST. A 6mer peptide, MGLGEM, was chosen to run BLAST; then an artificial ‘‘conserved aminoacid’’ was chosen as a pattern to run PHIBLAST (Pattern x(5)M) and PEPPAT (Pattern .{5}M). The subject database is Drosophila proteome.Owing to the shortness of the query, BLAST can not identify any hits even when the E value is at 100,000, and of course11 BLAST will notallow a customized pattern either. Then PHI-BLAST was run, since PHI-BLAST calculates similarity only of the pattern flanking regions;again, no hits were identified even when the E value was set at 100,000. However, with PepPat, hits are readily identified and ranked basedon similarity scoring (not shown).

Table 1.

344 Y. JIANG ET AL: PEPPAT AND PATTERN SEARCH

Page 5: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

Application of PepPat in mining peptide li-gand. To date, most peptide ligands are identified bylaborious biochemical or molecular biology ap-proaches. Since most peptide ligands are usuallyshort, it has been difficult to perform systematicmining of peptide ligands through computationalmethods. Regular sequence alignment algorithmscan be applied when one assumes that, besidesthe short functional peptides, the whole precursorproteins are significantly similar, which is not al-ways the case. Also, translation of fragmented orerror-containing sequence data (e.g. ESTs) some-times will not be able to produce long or accurateopen reading frames. PepPat is particular useful inthis regard.

With prior knowledge, one can develop a col-lection of query functional peptide patterns andcorresponding example peptides of interest; in ourcase, we developed a collection of patterns with as-sociated peptides representing ligands (proprietarydata), then apply PepPat to perform database search.Here we give two examples of applying PepPat inpeptide ligand mining and report the identificationof a novel tachykinin-like peptide; this is furthervalidated by expression and an activity study.

Mining RF amide-related peptides. The peptideFMRFGR is the precursor sequence of insect FMRFamide peptides (Price and Greenberg 1977); FMRFamide-related peptides (FaRPs) have been found

in both invertebrates and vertebrates with neuro-regulatory functions (Greenberg and Price 1992;Raffa 1991). The sequence diversity of FMRF amidepeptides has been recently described (Espinoza et al.2000). Besides containing RF amide peptides, theprecursor proteins usually do not share significanthomology (an initial try with known algorithmsfailed). Both FMRF amide and FaRPs have a con-served amide group, which is the product of post-translational processing at a C-terminal GK or GRsite. Therefore, a pattern for FMRF amides andFaRPs can be specified as ‘‘..RFG[KR]’’. Many FMRFamide and FaRP precursors are also repetitive. Wetherefore required that the pattern appear at leasttwice in subject hits. Known FMRF amides andFaRPs were identified as top hits in search of theNCBI NR database with this pattern, as was a re-cently described novel human FaRP precursor con-taining three repeated RF amides (Hinuma et al.2000). We also identified a novel human FaRP bysearching a proprietary database (Ying et al. patentpending). Table 2 is a summary of the search resultsby using the RF amide functional pattern to searchthe Drosophila melanogaster proteome. When wesearched the Drosophila proteome, the knownFMRF amide precursor was identified as the best hit.The second hit is the known Drosophila FaRP pre-cursor protein. The third hit also has very good ho-mology to the query FMRF amide and may representa novel FaRP.

PepPat search of the Drosophila melanogaster proteome with the specified pattern ‘‘..RFG[KR]’’ required to be present at least two timesin subject hits. The search was conducted by using a Compaq DEC AlphaServer 8400 5/625 with 4 Gbyte memory and required 5 CPUseconds on a single 21164A EV5.6 613 MHz processor. The Drosophila proteome contains 13,609 sequences (Adams et al. 2000). The firsthit (gi42102506) is the Drosophila FMRFamide precursor (Schneider et al. 1988). The second hit (gi42096089) is the Drosophila FaRPDrosulfakinin (Nichols and Taghert 1988). The third hit (gi42097717) is CG13968, a predicted cDNA of unknown function that mayrepresent a novel RF amide peptide. Only the top three hits are shown. Syntax: /pp -r 2 -v FMRFGR ‘‘..RFG[KR]’’ data/fly.fasta>pp.re-sult(Script executable/Number of Repeats specified by flag -r/flag -v verbose mode/Query Peptide/Pattern/Sequence Database File/redirectOutput File).

Table 2.

Y. JIANG ET AL: PEPPAT AND PATTERN SEARCH 345

Page 6: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

The identification of a novel tachykinin-likepeptide, C14TKL-1, as NK1 agonist. Tachykininsare a family of peptide neurotransmitters and havean important function in a variety of cellular path-ways. The sequences of tachykinins share a commonC-terminal amino acid sequence F-X-GLM-NH2(reviewed by Watson and Arkinstall4 1994). The

conserved C-terminal amide group is the post-translational modification product from G[KR][KR].This makes it a good application of PepPat.

To perform the PepPat search, a relaxed func-tional pattern was designed for tachykinin peptides,and a corresponding tachykinin, neurokinin A, wasused as the query peptide (see Materials and meth-

Fig. 1a. Express study ofC14TKL-1 by quantitativePCR. C14TKL-1 is ex-pressed in a wide range oftissue and brain regions.The expression ofC14TKL-1 in various hu-man tissues (three donors/tissue) was examined byquantitative PCR withprimer pairs and probesspecific to the predictedC14TKL-1 mRNA. Resultsare expressed as fg of tran-script/25 ng of inputcDNA as determined incomparison with a plasmidstandard and are displayedon a radial graph whereeach spoke is the expres-sion for the indicated tis-sue or human cell line.The distance from thecenter increases with therelative amount of tran-script present, and the re-sults have beennormalized to the amountof rRNA present in theRNA samples. Expressionwas considered significantif above the indicatedCt35>9.9·10-6 fg tran-script/25ng input cDNA.

346 Y. JIANG ET AL: PEPPAT AND PATTERN SEARCH

Page 7: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

ods). We searched a translation of the Incyte LifeseqEST database and identified a high-scoring hit as oneof the top hits, FYGLMGKR, in Incyte ESTs2889669H1, while BLAST and PHI-BLAST failedowing to the shortness of the query. Because the ESTmapped to Chr 14, we designated it Chr 14 tachy-kinin-like peptide 1 (C14TKL-1). On the basis of apossible peptidase processing site, we extended thesequence N terminally to infer a peptide sequencefor C14TKL-1 as: RHRTPMFYGLM-NH2.

In order to determine whether the C14TKL-1message is transcribed and to assess its tissue dis-tribution, we performed quantitative PCR on a va-riety of human cDNA libraries with primers and aprobe specific for the C14TKL-1 EST. Initially, a282-bp PCR fragment was obtained from a universalcDNA library and was used as a control and astandard for quantitation. A message for C14TKL-1was detected in most tissues, with higher levelspresent in peripheral organs, particularly in liver(Fig. 1a).

The previously identified tachykinins (SubstanceP, neurokinin A, and neurokinin B) are ligandsfor three G protein-coupled receptors NK1, NK2,and NK3 (Vanden Broeck et al. 1999). We there-fore examined the ability of C14TKL-1 to activateCHO cells transfected with human NK1. A dose-de-pendent rise in intracellular calcium was observed intransfected cells, but not in parental CHO cells(Fig. 1b) or in cells transfected with unrelated GPCRs(data not shown). Together, these data demonstratethat the gene for C14TKL-1 is in fact transcribed andthat the predicted peptide has biological activity. The

potency of C14TKL-1 is comparable to NK1’s knownligand, substance P, with EC50s of 1 and 0.8 nm,respectively. Further studies will be necessary tofully characterize this peptide and its precursor.

Discussion

We report here, first, a pattern-based oligopeptidehomology search program. Different from mostknown pattern-matching algorithms (Mehldau andMyers 1993; Pesole et al. 2000), PepPat combinessimilarity scoring with pattern matching. PepPatrequires the presence of a query example peptidebesides a query pattern. PepPat can thus rank patternhits according to their similarity to the query pep-tide, making it easy to identify the relevant patternhits. PepPat calculates similarity within the pattern-spanning region and performs non-gapped align-ment; thus, it is more suited for short peptide patternhomology searches. As shown in this report, PepPatreadily performs database search of a 6mer peptidepattern. In theory, query peptide of any word size isallowed. Although, for short functional peptides,gapped alignment may not be biologically desirable,for longer query oligopeptides or proteins, an algo-rithm performing gapped alignment is needed, suchas PHI-BLAST. PepPat should be viewed as comple-mentary to PHI-BLAST to deal with short pattern-based peptide homology searches, especially whenthe subject database contains fragmented protein,such as those found in EST translations and virtualpredicted exons.

Fig. 1b. C14TKL-1 activity study.C14TKL-1 is a tachykinin receptor agonist.CHO cells expressing human NK1 wereexamined for their ability to releaseintracellular calcium in response toC14TKL-1 (solid line) and substance P(dashed line, control). Data shownrepresent one of three experiments andare plotted as peak fluorescence (counts)versus log concentration (nM).

Y. JIANG ET AL: PEPPAT AND PATTERN SEARCH 347

Page 8: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

PepPat’s flexibility is demonstrated in shortpeptide ligand mining: for RF amide-related peptides,PepPat searches with a 6mer peptide pattern identi-fied suitable candidates. PepPat also allowed thespecification of recurrence of a particular pattern,which increased the specificity of this search. As avalidation of this method, PepPat identified a novelmammalian tachykinin-like peptide from a frag-mented protein database of EST translations,whereas other exciting methods failed. PepPat alsoidentified novel RF amide-related peptides (Jianget al. unpublished; patent pending).

The novel tachykinin-like peptide discovered byPepPat is from an EST sequence. Translation of thisEST gives rise to an ORF that contains the identi-fied peptide. The expression study, which provedthe existence of this message, shows that thehighest expression is5 from peripheral, especiallyliver. The agonist activity of this peptide is at parwith the known NK1 ligand substance P (VandenBroeck et al. 1999). Recently, a novel NK1 agonist,hemokinin, has been described (Zhang et al. 2000;Morteau et al. 2001; Belluci et al. 2002; Camarda etal. 2002). The relative biological roles of C14TKL-1,hemokinin, and other tachykinin peptides in neu-rokinin receptor biology will require further study.Whether there would be more receptors is still aquestion.

Acknowledgments

The authors thank Professor Gu Xiaochen of PekingUniversity, China for helpful discussions.

References

1. Adams MD, Celniker SE, Holt RA, Evans CA, GocayneJD et al. (2000) The genome sequence of Drosophilamelanogaster. Science 287, 2185–2195

2. Altschul SF, Gish W et al. (1990) Basic local alignmentsearch tool. J Mol Biol 215, 403–410

3. Altschul SF, Madden TL et al. (1997) Gapped BLASTand PSI-BLAST: a new generation 01 protein data-base search programs. Nucleic Acids Res 25, 3389–3402

4. Apweiler R, Attwood TK et al. (2001) The InterProdatabase, an integrated documentation resource forprotein families, domains and functional sites. NucleicAcids Res 29, 37–40

5. Attwood, TK, Blythe MJ et al. (2002) PRINTS andPRINTS-S shed light on protein ancestry. NucleicAcids Res 30, 239–241

6. Baeza-Yates R, Gonnet GH (1992) A new approach totext searching. Commun7 Assoc Comp Mach 35(10),74–82

7. Bateman A, Birney E et al. (2002) The Pfam proteinfamilies database. Nucleic Acids Res 30, 276–280

8. Bellucci F, Carini F et al. (2002) Pharmacological pro-file of the novel mammalian tachykinin, hemokinin 1.Br J Pharmacol 135, 266–274

9. Camarda V, Rizzi A et al. (2002) Pharmacological pro-file of hemokinin 1: a novel member of the tachykininfamily. Life Sci 71, 363–370

10. Corpet F, Servant F et al. (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole ge-nome comparison. Nucleic Acids Res

11. Dayhoff MO (1978) Survey of new data and computermethods of analysis. In: Atlas of protein sequence andstructure, vol. 5, suppl. 3. (Georgetown University,Washington, D.C.: National Biomedical ResearchFoundation)8

12. Durbin R, Eddy S et al. (1998) Biological sequenceanalysis: probabilistic models of proteins and nucleicacids. (Cambridge, UK: Cambridge University Press).ISBN 0521620414

13. Espinoza E, Carrigan M et al. (2000) A statistical viewof FMRFamide neuropeptide diversity. Mol Neurobiol21, 35–56

14. Falquet L, Pagni M et al. (2002) The PROSITE database,its status in 2002. Nucleic Acids Res 27, 215–219

15. Gracy J, Argos P (1998) DOMO: a new database ofaligned protein domains. Trends

16. Greenberg MJ, Price DA (1992) Relationships amongthe FMRFamide-like peptides. Prog Brain Res 92, 25–27

17. Henikoff JG, Greene EA (2000) Increased coverage ofprotein families with the blocks database servers.Nucleic Acids Res 28, 228–230

18. Henikoff S, and Henikoff JG (1992) Amino acid sub-stitution matrices from protein blocks. Proc Natl AcadSci USA 89, 10915–10919

19. Hinuma S, Shintani Y et al. (2000) New neuropeptidescontaining carboxy-terminal RFamide and their recep-tor in mammals. Nat Cell Biol 2, 703–708

20. Huang JY, Brutlag DL (2001) The EMOTIF database.Nucleic Acids Res 29, 202–204

21. Knuth DE, Morris JH Jr, Pratt VR (1977) Fast patternmatching in strings. SIAM J Comput 6, 323–350

22. Letunc I, Goodstadt L et al. (2002) Recent improve-ments to the SMART domain-based sequence annota-tion resource. Nucleic Acids Res 30, 242–244

23. Marchler-Bauer A, Panchenko AR et al. (2002) CCD: adatabase of conserved domain alignments with links todomain three-dimensional structures. Nucleic AcidsRes 30, 281–283

24. Mehldau G, Myers G (1993) A system for patternmatching applications on biosequences. Comput ApplBiosci 9, 299–314

25. Morteau O, Lu B et al. (2001) Hemokinin is a fullagonist at the substance P receptor. Nat Immunol 2,1088

26. Mount DW (2001) Bioinformatics: sequence and ge-nome analysis. (Cold Spring harbour, N.Y.: Cold SpingHarbor Laboratory Press) ISBN: 0879695978

27. Nichols R, Schneuwly SA, Dixon JE (1988) Identifica-tion and characterization of a Drosophila homologue

348 Y. JIANG ET AL: PEPPAT AND PATTERN SEARCH

Page 9: PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

to the vertebrate neuropeptide cholecystokinin. J BiolChem 263, 12167–12170

28. Pearson WR (2000) Flexible sequence similaritysearching with the FASTA3 program package. MethodsMol Biol 132, 185–219

29. Pearson WR, Lipman DJ (1988) Improved tools for bi-ological sequence comparison. Proc Natl Acad Sci USA85, 2444–2448

30. Pesole G,9 Liuni S, D’Souza (2000) PatSearch: a patternmatcher software that finds functional elements innucleotide and protein sequences and assesses theirstatistical significance. Bioinformatics 16, 439–450

31. Price DA, Greenberg MJ (1977) Structure of a molluscancardioexcitatory neuropeptide. Science 197, 670–672

32. Raffa RB (1991) The actions of FMRF-NH2 and FMRF-NH2 related peptides on mammals. NIDA Res Monogr105, 243–249

33. Schaffer AA, Wolf YI et al. (1999) IMPALA: matching aprotein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinfor-matics 15, 1000–1011

34. Schneider LR, Taghert PH (1988) Isolation and char-acterization of a Drosophila gene that encodes mul-tiple neuropeptides related to Phe-Met-Arg-Phe-NH2(FMRFamide). Proc Natl Acad Sci USA 85, 1193–1197

35. Vanden Broeck J, Torfs H, Poels J, Van Poyer W,Swinnen E et al. (1999) Tachykinin-like peptides andtheir receptors. A review. Ann NY Acad Sci 897,374–387

36. Wall L, Christiansen T, Schwartz RL (1996) Pro-gramming Perl,10 2nd edn. O’Reilly and Associates,Sebastapol, Ca.

37. Watson S, Arkinstall S (1994) The G-protein linkedreceptors. (New York: Academic Press), pp 261–271

38. Wheeler DL, Church DM et al. (2002) Database re-sources of the National Center for Biotechnology In-formation: 2002 update. Nucleic Acids Res 30, 13–16

39. Zhang Y, Lu L et al. (2000) Hemokinin is a hemato-poietic-specific tachykinin that regulates B lymphoie-sis. Nat Immunol 1, 392–397

40. Zhang Z, Schaffer A et al. (1998) Protein sequencesimilarity searches using patterns as seeds. NucleicAcids Res 26, 3986–3990

MEDLINE Abstract

http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ebi.ac.uk/interprohttp://www.bioinf.man.ac.uk/dbbrowser/PRINTS/http://pfam.wustl.eduhttp://protein.toulouse.inra.fr/prodom.htmlhttp://hmmer.wustl.eduhttp://www.expasy.ch/prosite/http://www.infobiogen.fr/~gracy/domo/home.htmhttp://blocks.fhcrc.orghttp://motif.stanford.edu/emotif/http://www.ebi.ac.uk/fasta3/http://bighost.area.ba.cnr.it/BIG/Patsearch/Pat-

search.htmlhttp://blocks.fhcrc.org/blocks/impala.htmlhttp://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgihttp://www.ncbi.nlm.nih.gov/BLAST/

Y. JIANG ET AL: PEPPAT AND PATTERN SEARCH 349