Miniprofiles: A New Tool to Evaluate PROSITE Pattern Matches

Miniprofiles: A New Tool to Evaluate PROSITE

Pattern Matches

by Beatrice Cuche

Supervisors: Dr. Nicolas Hulo and Dr. Christian J.A. Sigrist

Master in Proteomics and Bioinformatics

University of Geneva

5th July 2007

1

Abstract

The major disadvantage of patterns is that they do not produce a scoreassociated with their matches. In this work we have developped a newtool to evaluate PROSITE pattern matches. This tool consists in the au-tomatic creation of generalised profiles that we called miniprofiles. Eachminiprofile can be used in connexion with the pattern it was derived fromin order to assess pattern matches in TrEMBL as either True Positive ifthe miniprofile detects the matches or Unknown if it does not.Miniprofiles for 1271 PROSITE patterns were generated through our au-tomatic procedure, among which we believe that at least 1230 can be usedto assess pattern matches in TrEMBL.

2

Contents

1 Introduction 41.1 PROSITE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 PROSITE Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 PROSITE Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Creation of PROSITE profiles . . . . . . . . . . . . . . . . . . . . 71.5 Calibration of PROSITE profiles . . . . . . . . . . . . . . . . . . 81.6 Multiple Sequence Alignment Tools . . . . . . . . . . . . . . . . . 81.7 Aim of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Materials and Methods 92.1 Automatic Miniprofile Construction-Default Method . . . . . . . 10

2.1.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Seed Alignment . . . . . . . . . . . . . . . . . . . . . . . . 122.1.3 Multiple Sequence Alignment (MSA) . . . . . . . . . . . . 132.1.4 Weighting the Alignment . . . . . . . . . . . . . . . . . . 132.1.5 Making the Matrix . . . . . . . . . . . . . . . . . . . . . . 132.1.6 Calibrating the Matrix . . . . . . . . . . . . . . . . . . . . 132.1.7 Changing the Cut-off at Level 0 . . . . . . . . . . . . . . . 13

2.2 Assessing the Quality of Miniprofiles . . . . . . . . . . . . . . . . 142.3 Automatic Profile Construction-Other Methods . . . . . . . . . . 152.4 Automatic Profile Construction-All Methods . . . . . . . . . . . 162.5 Editing of the Profile . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 Queries on Incmatch . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Results 193.1 Miniprofiles inserted into Incmatch . . . . . . . . . . . . . . . . . 19

3.1.1 all-TP Miniprofiles inserted into Incmatch . . . . . . . . . 193.1.2 not-all-TP Miniprofiles inserted into Incmatch . . . . . . 213.1.3 Patterns having at least one profile in the same PROSITE

documentation . . . . . . . . . . . . . . . . . . . . . . . . 213.1.4 not-all-TP and Problematic Miniprofiles that were not in-

serted into Incmatch . . . . . . . . . . . . . . . . . . . . . 223.1.5 Statistics about the 1309 patterns and the 1271 minipro-

files included in Incmatch . . . . . . . . . . . . . . . . . . 233.2 Comparing T-Coffee, ProbCons and ClustalW . . . . . . . . . . . 243.3 Incmatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Testing Miniprofiles in Swiss-Prot . . . . . . . . . . . . . 253.3.2 TrEMBL . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Discussion 28

5 Appendix 31

6 Acknowledgments 37

3

1 Introduction

1.1 PROSITE

PROSITE is a database of protein families and domains that was started in1988 [1]. There are two kinds of motif descriptors used in PROSITE: patternsand profiles. Both patterns and profiles are derived from multiple sequencealignments (MSA) of homologous sequences. MSA allow to recognize distantrelationships between sequences that would be unnoticed by the sole use ofpairwise alignments [2].The first profiles were created and added into PROSITE in 1993 [3] and by1995, the database contained 6 profiles [4].Patterns and profiles have complementary qualities in terms of the regions theyare best to be applied to. On one hand, patterns are powerful predictors ofprotein function when it comes to small regions with high protein sequencesimilarity, such as enzyme catalytic sites; on the other hand, profiles are moreappropriate for predicting protein less conserved regions in proteins when theycover entire domains which is often the case [2].The PROSITE database is composed of two main text files that form the coreof the database: prosite.dat and prosite.doc. Both files are updated weekly andare available for download atftp://ftp.expasy.org/databases/prosite/.

• prosite.dat is a computer-readable file that gives all the necessary infor-mation for computers to scan for the occurence of patterns or profiles inprotein sequences. It contains, for each entry, both statistics concerningthe number of hits obtained when scanning the Swiss-Prot section of theUniProt Knowledgebase (hereafter Swiss-Prot) [5] for a given pattern orprofile, and cross-references to Swiss-Prot and PDB 3D structure database[6].

• prosite.doc incorporates complete textual documentation regarding eachpattern or profile.

The latest version of PROSITE (at the time of the redaction: release 20.15 ofJune 26, 2007) contains 1319 pattern and 737 profile entries.PROSITE has also been part of InterPro since its creation in 1999 [7]. Inter-Pro is an integrated documentation resource for protein families, domains andfunctional sites. It now contains 10 protein signature databases [8].

1.2 PROSITE Patterns

Patterns are regular expressions matching short sequence motifs usually of bio-logical meaning.A pattern is a particular cluster of residue types, which is also often called motifor signature. The size of a pattern is usually 10 to 20 amino acids in length. Suchresidue clusters enable to establish relationships between sequences that are too

4

distantly related for the said relationship to be revealed by pairwise alignmentsof sequences. Patterns originate from the fact that specific residues or regionsthat are significant for the biological function of a group of proteins are evolu-tionarily conserved in both structure and sequence. Such biologically importantclusters of residues or regions include, for example, amino acids involved in bind-ing metal ions and cysteines involved in disulphide bonds. A pattern or regularexpression is a consensus expression that is designed following a multiple align-ment of biologically meaningful motifs. It is indeed possible to reduce such bio-logically meaningful motifs to regular expressions precisely because these motifsare conserved throughout evolution [2]. Here is an example of pattern, PS00104,an EPSP synthase (3-phosphoshikimate 1-carboxyvinyltransferase) signature:

[LIVF] - {LV} - x - [GANQK] - [NLG] - [SA] - [GA] - [TAI] - [STAGV] - N - R- x - [LIVMFYAT] - x - [GSTAP]

The amino acids are named by their one-letter code.

• [LIVF]: indicates that, at this position, any of the 4 amino acids betweenthe square brackets is acceptable. Usually, when a choice is available at agiven position, as it is the case here, the possible amino acids have similarphysicochemical properties; in this case the 4 amino acids are hydrophobic.

• {LV}: indicates that any amino acid can be found at this specific positionexcept L and V, the 2 amino acids between the brackets. Again, if morethan one amino acid is forbidden at a certain position, these amino acidsmost often have similar properties; here V and L are both hydrophobicand the geometry of their respective side chains is very similar.

• x: means that any amino acid is compatible with this position.

• R: means that R and only R can occupy this position, any other residuewill cause the sequence to be a mismatch.

The 3-phosphoshikimate 1-carboxyvinyltransferase (P17688) from the chloro-plast of Brassica napus, for example, is a protein match for PS00104 because itcontains the following subsequence:

L-Y-L-G-N-A-G-T-A-M-R-P-L-T-A

The complete syntax of a PROSITE pattern is available athttp://www.expasy.org/tools/prosite/prosuser.html.A regular expression is intrinsically qualitative, no matter the field considered.It means that either a sequence is a match for a given pattern or it is not, thereis no such thing as a midly good pattern match. It implies that when it comes topatterns, the quality of a match cannot be evaluated as patterns do not producescores. Nevertheless, the accuracy of a pattern can be judged by computing 2

5

parameters: its precision and its recall.

Precision =TP

TP + FP(1)

Recall =TP

TP + FN(2)

where TP are the known true positives or true hits; FP, the known false positivesor false hits; and FN, the known false negatives or missed hits. These statistics(TP, FP, FN) are obtained thanks to the process of manual annotation of theSwiss-Prot database.Families or domains are sometimes defined by the occurrence of one high speci-ficity pattern, or by the co-occurence of several lower specificity patterns.The two main advantages of patterns are that they are readily intelligible forany user and that they enable the detection of the most conserved residues,which often happen to be the most relevant in terms of biological function. Ad-ditionally, scanning a protein database with a pattern is not too time-consuming[2].

1.3 PROSITE Profiles

As discussed above, patterns present intrinsic limitations that prevent themfrom identifying distant homologues. The globin domain is an example of suchdomains that cannot be identified by patterns. Indeed, the globin domain con-tains only a few very well conserved positions.To tackle this problem, i.e. the identification of such poorly conserved domainsor families, generalised profiles have been developed. Generalised profiles (orweight matrices) are able to detect more distant homologues because of theirincreased discriminatory power. In contrast to patterns, profiles are quantita-tive motif descriptors, which means that unlike patterns, profiles assign a degreeof similarity to potential matches rather than a status of true or false. A numer-ical weight is assigned to each possible match or mismatch between a sequenceresidue and a profile position, then the sum of the weights for each position iscalculated and compared to a threshold.The enhanced sensitivity of profiles also results from the sophistication of theprofile construction method. In addition, whereas patterns only cover the mostconserved parts of domains, profiles usually cover domains over their entirelength [2].A profile is a linear structure composed of alternating match, insertion anddeletion positions.

• A match position is usually occupied by a single amino acid. It assignsa weight for each possible residue type at this position as well as a deletionextension penalty.

• An insertion position assigns a weight for insertion. It also providesa parameter for opening/closing of a deletion gap as well as for initia-tion/termination of a partial alignment to the profile.

6

• A deletion position assigns a weight for deletion. [2]

1.4 Creation of PROSITE profiles

Most profiles present in PROSITE were generated by the programs implementedin the pftools package. This package can be downloaded athttp://www.isrec.isb-sib.ch/ftp-server/pftools/.Here are the steps involved in the construction of profiles:

1. Multiple alignment of domains extracted from full sequences.This step is crucial as the quality of the alignment usually has a consider-able impact on the quality of the profile created at the end of the process.There are several MSA programs among which T-Coffee [9] is a commonlyused one.

2. Attribution of weight to individual sequences of the MSA. This isachieved with pfw, which computes Voronoi weights [10], or with Prosite-

Weight [11], which is based on the GSC algorithm. The goal of this stepis to prevent sequences in the multiple alignment belonging to the samesubfamily to be over-represented in the making of the matrix.

3. Profile Construction. It is the conversion of the weighted alignmentinto a an ’unscaled profile’. It involves two sub-steps, both implementedin pfmake:

a. Generation of a frequency profile whose data structure is equivalentto a hidden Markov Model (HMM). This is why generalised profiles andprofiles-HMM are so readily inter-convertible. This step results in eachcolumn of the MSA to be mapped to a match or an insert position of theprofile, depending on the number of gaps the column in question contains.

b. Conversion of the frequency profile into a scoring profile, i.e. trans-formation of the amino acid frequencies into weights. This is done usingthe Gribskov’s original formula [12], which gives a match weight for eachresidue, according to the substitution scores of the substitution matrixused. Then, the resulting numbers in the profile constitute the weightedaverage of the substitution scores of the residue in the query sequencecompared to the residues observed in the corresponding column of themultiple alignment. The two commonly used scoring matrices are PAMand BLOSUM [13] and, among them, BLOSUM45 is the most often usedbecause it has been proven to be the best one in terms of sensitivity andselectivity of the produced profiles [14].The transition scores which are the weights applied to insertions and dele-tions are also calculated by a formula derived from Gribskov’s method[12].

At this stage, the profile is constructed but it still needs to be calibrated torender the scores interpretable.

7

1.5 Calibration of PROSITE profiles

It is the transformation of an ’unscaled’ profile into a ’scaled’ one. The ’unscaled’profile assigns a raw score to a potential match. In order to be relevant in termsof biological interpretation, the row score must be converted into a normalizedscore. A normalized score is defined as the base 10 logarithm of the size (inresidues) of the database in which one false positive match is expected to occurby chance[15].

N Score = log10DB size− log10Evalue (3)

Evalue = DB size ∗ 10−N Score (4)

Where the DB size is the database size in residues. For Swiss-Prot, the DB sizeis 99940143 residues (release 53.2 of 26-Jun-07).For instance a match with a normalized score of 8.00 or higher is expected tooccur about once in Swiss-Prot, a normalized score of 8.5 corresponds to anE-value of 0.32 and an E-value of 0.01 represents a normalized score of 10.00.For calibrating profiles, PROSITE uses an empirical method:

1. Collecting high-scoring matches by scanning a randomised database suchas window20 or reversed [2].

2. Estimating the parameters of the normalisation function by computing thecumulative distribution of the 2000 highest scores and then fitting theircumulative distribution to an extreme value distribution.

Each PROSITE profile usually contains two cut-off levels:

1. The default level 0, which has its cut-off value set by default to a nor-malized score of 8.5. Level 0 enables the classification of matches as truepositives and negatives.

2. A low cut-off level, called level -1, which is defined for weak matches; thislevel is useful in gene discovery, for example [2].

1.6 Multiple Sequence Alignment Tools

There are many alignment programs among which ClustalW is one of the oldest.This program realigns protein sequences in three main steps. First, it performspairwise alignments for all against all sequences to create a library of globalalignments, then it makes a phylogenetic tree using the library created in thefirst step and, finally, it uses the phylogenetic tree to carry out the multiplealignment [16].T-Coffee (Tree-based Consistency Objective Function For alignment Evalua-tion) also uses a progressive strategy for aligning sequences. However, in thefirst step, two libraries are created from two kinds of alignments: global align-ments from ClustalW and local alignments from Lalign. These two librariesare then combined into a single one, before forming a Neighbor Joining tree to

8

perform the multiple alignment [17].ProbCons (Probabilistic Consistency-based multiple sequence alignment) is theonly MSA program known to use a probabilistic consistency method of align-ment. Its procedure is divided in four steps: first, posterior-probability matricesare computed for every pair of sequences. In the second step the expected ac-curacy of every pairwise alignment is calculated by dynamic programming; inthe third step the expected accuracy calculated in the second step is used tocalculate a guide tree. In the forth and final step the guide tree is used to alignthe sequences progressively [18].

1.7 Aim of the Project

The major problem of patterns is that they do not produce scores, thereforethe value of a match cannot be evaluated. The aim of this project is to be ableto automatically assign either a TP or Unknown status to PROSITE patternmatches for proteins of the TrEMBL section of the UniProt Knowledgebase(hereafter TrEMBL) [5].To achieve this, one miniprofile is created for each pattern using pattern TPmatches collected in Swiss-Prot. If the miniprofile detects pattern FP matchesin Swiss-Prot, it is then considered inadequate to be used in TrEMBL; else thesaid miniprofile is run on each sequence matched by the corresponding patternin TrEMBL. If the miniprofile detects the same sequence, the sequence is thengiven a TP status; if not, the status remains Unknown.

2 Materials and Methods

-programing language: Perl version 5.8.6 .-operating system: Fedora 5 Linux.

An exhaustive list of all the patterns was extracted from PROSITE (release20.2 of December 12, 2006). This list contains 1328 pattern entries, all desig-nated by the prefix ”PS” followed by five digits, forming a unique identifier foreach pattern.A miniprofile was constructed for each of these patterns when possible. It wasimpossible to create a miniprofile for 19 of them. This impossibility arised fromtwo different reasons:

1. The pattern had been deleted from the database in the meantime betweenthe release and the construction of the miniprofiles. This was the case forseven patterns (e.g. PS00334).

2. The pattern had no maintained match list, because it was too unspecific-itwas flanked with a ”high probability of occurence” flag in PROSITE. Thiswas the case for 12 patterns (for example PS00294, a prenyl group bindingsite (CAAX box)).

9

This left us with 1309 patterns for which associated profiles were built.A program was designed to automatically generate miniprofiles and, to someextent, judge their quality. This program, create profile.pl, together with itssubordinate module, profile functions.pm, are on the prosite page of confluence.

2.1 Automatic Miniprofile Construction-Default Method

This section details the default procedure that was used to construct minipro-files. It involves several steps as shown in Figure 1. Other methods are detailedin section 2.3. Pattern PS00053, which is a ribosomal protein S8 signature, isused as an example throughout this section.

2.1.1 Extraction

The very first step in profile creation called for the extraction of sequences fromIncmatch, a relational database of hits of profiles and patterns on Swiss-Protand TrEMBL. The different kinds of sequences were extracted by perfomingStructure Query Language (SQL) queries on the database. All extraction querieswere restricted to Swiss-Prot only, excluding TrEMBL, and fragments were nottaken into account. These requests were aimed at extracting:

10

Figure 1: Steps of the default method for the creation of miniprofiles. In step 2 the 3 ’trim’substeps are performed on the short sequences; then, in step 4, the MSA is performed onlyon the long sequences whose corresponding short sequences were not removed in step 2.

• True Positive (TP) Sequences: these were the full sequences thatwere matched at least once by the pattern and that were not annotatedas FALSE POS or UNKNOWN in the column ’note’ of the table ’dben-try 2 database’ of the ’sptr’ database.

• False Positive (FP) Sequences: these were the full sequences thatwere matched at least once by the pattern and that were annotated asFALSE POS in the column ’note’ of the table ’dbentry 2 database’ of’sptr’. Out of 1309 patterns, 391 (30%) had FP matches.

• False Negative (FN) Sequences: they were the full sequences thatwere not matched by the pattern but should have been because they

11

belong to the same family or subfamily as the TP. In a similar fash-ion to the two previous SQL queries, this one involved a constraint insptr.dbentry 2 database.note to FALSE NEG. Out of the 1309 patterns,862 (66%) were found to have FN.

346 (26%) patterns had both FN and FP matches, 907 (70%) had FNand/or FP matches, whereas 402/1309 (31%) had neither of them.

• Short Sequences: these represented the exact regions matched by thepattern for each true positive match.

• Long Sequences: these represented fifteen amino acids in the amino-terminus of short sequences followed by the actual short sequences andfifteen amino acids in the carboxy-terminus of short sequences.

Once the different sequences were extracted, TP, FP and FN sequences werereformatted into FASTA format using the seq reformat function of T-Coffee1.Short sequences were realigned with psa2msa2.

PS00053 TP sequences: 357FP sequences: 1FN sequences: 20

2.1.2 Seed Alignment

The obtention of the seed alignment was achieved in four subsequent steps:

1. Trim 95: In order to reduce redundancy in the alignment, short sequences

that were at least 95% identical were removed by using the ’trim’ optionof T-Coffee.

2. Trim 20: This step consisted in removing short sequences that were lessthan 20% identical to the others in order to get rid of outlayers. Thepurpose of this step was to prevent a sequence that would have beenmistakingly annotated as TP and was in fact a FP to become part ofthe seed alignment. This too was achieved by using the ’trim’ option ofT-Coffee3.

3. Trim 90: This step was nearly the same as the first one, except that thistime short sequences that were at least 90% similar to each other wereexluded and only the 100 most significant sequences were kept in order tobring the input of multiple alignment tools to a manageable size.

1t-coffee version 4.18.2psa2msa 2.3 revision 2.3t-coffee exceptionnally version 4.45.

12

4. Removal of Long Sequences: long sequences whose correspondantshort sequences had been eliminated in the three preceding steps wereremoved here.

PS00053, number of the 357 short sequences left after successively:trim95: 208trim20: 208trim90: 100

2.1.3 Multiple Sequence Alignment (MSA)

In the default method, the MSA was performed using T-Coffee.

2.1.4 Weighting the Alignment

Once the MSA had been performed, the sequences were weighted using pfw 4.

2.1.5 Making the Matrix

pfmake 5 was used to construct the miniprofiles. The options chosen were: ’-2b’and ’-H2.0’.’2’ stands for semi-global alignment mode, ’b’ for block profile mode and ’2.0’was the score assigned to ’H’ which is the high cost initiation/termination valueto force the profile to produce full length matches. The comparison matrix usedwas BLOSUM45 [19].

2.1.6 Calibrating the Matrix

Each matrix was calibrated with pfcalibrate.cluster using the ’-v’ (verbose) andthe ’-M’ (profile mode number to scale) options. The database called for calibra-tion was ’window20’ since ’reversed’ causes problems with palindromic patterns.

2.1.7 Changing the Cut-off at Level 0

Once the profile had been calibrated, the default value of the cut-off at level 0,which is 8.5, was modified according to the outcome of the following procedure:

• pfsearch6 ,7 was performed on all TP sequences with option ’-a’ which forcesto report alignment for all sequences, even the ones that are below the cut-off.

4pfw 2.3 revision 2.5pfmake 2.3 revision 2.6pfsearch 2.3 revision 2.7tcoffee (version 4.45), psa2msa, clustalw, pfw, pfmake, pfcalibrate and pfserach are all

implemented in a package called psmaker.

13

• The lowest score was used to attribute the new cut-off:

Cut off = Lowest Score − 0.010 (5)

PS00053 lowest score = 15.451Cut-off = 15.451 - 0.010 = 15.441

The lowest limit for the cut-off was set to 8.000 no matter what the lowest scorefor pfsearch of the TP sequences was; i.e. if the result of the previous equationwas found to be inferior to 8.000, the resulting cut-off was automatically set to8.000.Because the cut-off was changed at level 0, the raw score at level 0 had to beadjusted according to the following equation:

Raw Score = int[Cut off − R1

R2] + 1; (6)

The old values for the cut-off and score were replaced by the new ones in thecorresponding miniprofile.

PS00053 R1=5.2529759R2=0.0151300N SCORE = 15.441SCORE =int[(15.441-5.2529759)/0.0151300]+1SCORE = 674 (previously 215)

N.B. The cut-off value at level -1 was never modified; it remained 6.5.

2.2 Assessing the Quality of Miniprofiles

Each automatically generated profile was assessed as all-TP, not-all-TP or prob-

lematic. This evaluation was carried out in the following way:

• pfsearch of TP sequences

• pfsearch of FP sequences

-all-TP profile: Matches all the TP sequences of its corresponding pattern. Doesnot match any FP sequences of its corresponding pattern.-not-all-TP profile: Fails on matching some of the TP sequences of its cor-responding pattern. Does not match any FP sequences of its correspondingpattern.-Problematic profile: Matches one or more of the FP sequences of its correspond-ing pattern.

PS00053 pfsearch of TP sequences: 357pfsearch of FP sequences: 0

14

As the miniprofile for PS00053 produced by the default procedure was able tomatch all the TP sequences and no FP sequences, it was classified as an all-TP

miniprofile.pfsearch of FN sequences was carried out as well, although it was not used forprofile classification.

2.3 Automatic Profile Construction-Other Methods

When the default method failed to produce all-TP profiles, other methods wereconsecutivelly tried out. These new methods involved either changing someprograms and/or changing program parameters. When a new method was triedout, the patterns for which all-TP profiles had been created in the precedingmethods were no longer considered.

• Several modifications have been attempted in the ’seed alignment’ step.The default method called for 3 ’trim’ substeps, the last one limiting thenumbers of sequences to 100 (cf section 3.1.2. ’Seed Alignment’). Here 4other kinds of ’trim’ have been tried out: for them, 1 ’trim’ step replacedthe 3 ’trim’ substeps of the default method.

– trim99 200: short sequences that were more than 99% similar wereexcluded and only the 200 most significant ones were kept.

– trim90 100: short sequences that were more than 90% similar wereexcluded and only the 100 most significant ones were kept.

– trim90 (no limit): short sequences that were more than 90% similarwere excluded. All the others were kept.

– trim99 (no limit): short sequences that were more than 99% similarwere excluded. All the others were kept.

• Two other MSA tools (other than T-Coffee [17]) were tried out: ProbCons8

[18] and ClustalW9 [16].

• A different profile making tool than pfmake was tried out: apsimake, aprogram developed developed by Dr.L.Cerutti at the Swiss Institute ofBioinformatics in Lausanne.

• Profile making parameters: different scores were tried out for the ’H’ valuehigh cost initiation/termination score.

• Another comparison matrix than BLOSUM45 was tried out: BLOSUM65.

• Only the randomised databases window20 was tried out in the ’calibrationstep’.

8probcons version 1.1.9clustalw version 1.81.

15

Methods 1 and 8 were performed on the 1309 patterns in order to compare theperformance of T-Coffee (default method), ProbCons and ClustalW in the scopeof miniprofiles creation [20].The three following tables show all the details of the methods that were at-tempted:

Table 1: Default method and methods 1-3.

Default Method Method 1 Method 2 Method 3

Trim 95, 20, 90 100 95, 20, 90 100 95, 20, 90 100 95, 20, 90 100

MSA tool tcoffee probcons tcoffee tcoffee

pf making pfmake pfmake apsimake apsimake

parameters -H2.0 -H2.0 -H0.5 -H3.0

matrix BLOSUM45 BLOSUM45 BLOSUM45 BLOSUM45

Table 2: Methods 4-7.

Method 4 Method 5 Method 6 Method 7

Trim 95, 20, 90 100 99 200 99 200 99 200

MSA tool tcoffee probcons probcons probcons

pf making apsimake apsimake apsimake apsimake

parameters -H8.0 -H0.5 -H2.0 -H0.5

matrix BLOSUM45 BLOSUM45 BLOSUM45 BLOSUM65

Table 3: Methods 8-12.

Method 8 Method 9 Method 10 Method 11 Method 12

Trim 95, 20, 90 100 95, 20, 90 100 90 (no limit) 90 (no limit) 99 (no limit)

MSA tool clustalw clustalw clustalw probcons probcons

pf making pfmake pfmake pfmake pfmake apsimake

parameters -H2.0 -H0.5 -H0.5 -H0.5 -H0.5

matrix BLOSUM45 BLOSUM45 BLOSUM45 BLOSUM45 BLOSUM45

2.4 Automatic Profile Construction-All Methods

For three patterns, the realignment of short sequences with psa2msa systemat-ically failed. Since it was not possible to realign these sequences, this step wasgiven up for these 3 patterns, regardless of the method considered.For forty other patterns, all the long sequences were kept in the seed alignment.

16

It was decided that for them the 3 ’trim’ substeps would be given up altogether.When performing any of the methods, the lowest limit for the cut-off was firstset to 8.000, then if the generated profiles were not-all-TP or problematic, thislowest limit was set to 6.500 on certain conditions:

• The lowest score when performing pfsearch of all TP sequences had to beat least 6.510.

• The lowest score when performing pfsearch of all TP sequences had to behigher than the highest score of all FP sequences when performing pfsearch

on FP sequences.

In other words, the limit for the cut-off was lowered only if this could ”trans-form” a not-all-TP or problematic profile into an all-TP one.

2.5 Editing of the Profile

For each miniprofile, 4 lines were changed and 3 lines were added with the helpof the infomation available in prosite.dat for the corresponding pattern as wellas what was found perfoming pfsearch of TP and FP sequences.The prefix ’PS’ was replaced by ’MP’, which stands for ’miniprofile’.

Changed Lines:

1. ID line, changed from:ID SEQUENCE PROFILE; MATRIX.into

ID RIBOSOMAL S8 MP; MATRIX.

2. AC line, changed from:AC ZZ99999;into

AC MP00053;

3. DT line, changed from:DT Tue Mar 13 13:52:36 CET 2007into

DT MAR-2007 (CREATED); MAR-2007 (DATA UPDATE); MAR-2007 (INFOUPDATE).

4. DE line, changed from:DE AMSA profile.intoDE Ribosomal protein S8 signature.

17

Added Lines:

1. CC line

CC /WARNING FP FOUND BY THE PROFILE:/

2. CC line

CC /WARNING TP NOT FOUND BY THE PROFILE:/

3. DO line

DO PDOC00052;

In this specific case, as the produced miniprofile was of good quality, nothinghad to be added on any of the two warning lines. Here is an example of awarning line corresponding to a profile that failed to match two TP sequences:

CC /WARNING TP NOT FOUND BY THE PROFILE: LTBP2 MOUSEO08999 1551-1562, LTBP2 RAT O35806 1505-1516, /.

It contains the identifier, accession number and the positions for both missedmatches.In relation with each built profile, a minireport was created. These reports con-tain all the information on how miniprofiles were obtained as well as informationthat is not mentioned in the profiles themselves such as the number of FN se-

quences that were successfully matched by the profiles in question. Miniprofileand minireport for MP00053 can be found as examples in the Appendix, Figures5 and 6 respectively.

2.6 Queries on Incmatch

Once the miniprofiles were included in Incmatch, numerous queries were per-formed on the database to extract the following information for each patternand corresponding miniprofile:

On Swiss-Prot:

• number of TP matches for each pattern.

• number of TP matches for each pattern that are matched by the corre-sponding profile too.

• number of FN for each pattern.

• number of FN for each pattern that are matched by its correspondingprofile.

• number of FP matches for each pattern.

• number of FP matches for each patternt that are matched by the corre-sponding profile too.

18

• total number of matches (TP + FP) for each pattern.

• number of matches at level 0 for each profile.

• number of matches at level -1 for each profile.

On TrEMBL:

• number of matches for each pattern.

• number of matches at level 0 for each profile.

• number of matches at level -1 for each profile.

3 Results

3.1 Miniprofiles inserted into Incmatch

Miniprofiles derived from 1271/1309 (97%) patterns were inserted into Inc-

match. Among which:

• 1237/1271 (97%) all-TP miniprofiles: matching all the TP sequences ofthe corresponding pattern.

• 34/1271 (3%) not-all-TP miniprofiles: failing to match some of the TPsequences of the corresponding pattern.

• No Problematic miniprofiles: matching FP sequences of the correspondingpattern.

The vast majority of all-TP miniprofiles (1170/1237) were produced by per-forming the default method. Then 67 other all-TP miniprofiles were obtainedby trying out other methods. After this 72/1309 (6%) patterns had no all-TP

miniprofiles associated with them yet (Table 4).

3.1.1 all-TP Miniprofiles inserted into Incmatch

Figure 2 details the MSA tools as well as the profile making tools used to createthe 1237 all-TP miniprofiles included in Incmatch. T-Coffee and pfmake led tothe creation of most of the all-TP miniprofiles. This does not mean they arerespectively the best MSA tool and profile making tool; it simply means thatthey were part of the default method, which was attempted first.

19

Table 4: Details about the methods by which the 1271 profiles inserted intoIncmatch were obtained.

Method all-TP not-all-TP total

Default 1170 1 11711 23 1 242 25 30 553 3 1 44 4 5 95 8 4 126 0 8 87 0 0 08 0 0 09 2 4 610 2 3 511 0 1 112 0 1 1

total 1237 34 1271

Methods 1-12 were attempted only for patterns for which no associated all-TP miniprofileswere produced by performing the default method.

Figure 2: Precision concerning the MSA tools as well as the matrix making tools used toconstruct the 1237 all-TP generalized profiles that were included in Incmatch. Miniprofilesobtained by a. default method, b. methods 2, 3 and 4, c. method 1, d. methods 5, e. methods9 and 10 and f./.

Detailed workflows of the all-TP miniprofiles produced following MSA with T-

Coffee (1202/1237), ProbCons (31/1237) and ClustalW (4/1237) can be foundin the Appendix Figures 7, 8 and 9 respectively.

20

3.1.2 not-all-TP Miniprofiles inserted into Incmatch

After all the methods were applied,there were still 72/1309 patterns that hadno associated all-TP miniprofiles. At this stage, 32/72 not-all-TP miniprofileswere inserted into Incmatch. As each method had been tried out for these 32patterns, the method chosen was the one leading to the miniprofile failing tomatch the minimum of TP. If more than one method produced miniprofilesmissing on the same number of TP, the method chosen was the one giving riseto the miniprofile with the highest cut-off. Among these 34 miniprofiles theproportion of missed matches was quite low, 6% in average. The miniprofilethat missed the most TP, MP00198, derived from PS00198 (4Fe-4S ferredoxins,iron-sulfur binding region signature) missed 459 out of 1497 matches. This isalso the profile which fails to match the highest percentage of TP: 459/1497(31%).

3.1.3 Patterns having at least one profile in the same PROSITEdocumentation

Another set of 26 patterns out of the 72 happened to be part of the samePROSITE documentation as one or several profiles. An investigation was carriedout to see if these patterns might already have a profile in PROSITE thatcould be used to evaluate pattern matches in TrEMBL. To do this, all the TPmatches of these patterns were searched with the profile(s) present in the samedocumentation. The results of this investigation are reported in Table 10 for the22/26 patterns having only one profile in the same documentation, and Table11 for the 4/26 patterns having 2 or 3 profiles in the same documentation. Outof these 26 patterns, 18 have one profile in the same documentation matchingat least 85% of the pattern TP sequences.Documentations containing one profile and more than one pattern are trickybecause the profile often detects a whole family, while patterns are aimed atdetecting active sites of the same family. For instance, PS50240 detects thewhole trypsin family of serine proteases while PS00134 and PS00135 detectone subfamily each, respectively the histidine active site and the serine activesite[21]. At the same time, if we consider the fact that profiles will be usedto assess pattern matches only after these matches have been detected by thepatterns, these profiles matching whole families are perfectly suitable to confirmpattern matching subregions, since the patterns would have to be present in thefirst place in protein matches.Among the 26 patterns, there were 18 patterns for which existing profiles inPROSITE could be used to assess their matches in TrEMBL in relation to thepercentage of pattern TP matched by the profile.A pfsearch was performed on these 18 patterns FP matches with the profilepresent in the same documentation in order to see if the profile detected FPmatches of the pattern. It was found that for 4 patterns, corresponding profilesdetected pattern FP matches. This brings the number of patterns for which a

21

profile already in PROSITE could be used to assess matches in TrEMBL to 14.

3.1.4 not-all-TP and Problematic Miniprofiles that were not insertedinto Incmatch

The remaining 12 miniprofiles were not entered into Incmatch for several rea-sons, as can be seen in the following table (Table 5). For the first pattern,

Table 5: 12 miniprofiles that were not inserted into Incmatch.

missed pattern % missed matched FP % matchedAC ID TP TP TP FP pattern FP

PS00014 ER TARGET 255 258 98.84% 0 114 0.00%

PS00103 PUR PYR PR TRANSFER 26 393 6.62% 105 150 70.00%PS00103 PUR PYR PR TRANSFER 0 393 0.00% 143 150 95.33%PS00120 LIPASE SER 167 169 98.82% 0 47 0.00%PS00120 LIPASE SER 4 169 2.37% 23 47 48.94%PS00142 ZINC PROTEASE 208 593 35.08% 0 140 0.00%PS00142 ZINC PROTEASE 117 593 19.73% 1 140 0.71%PS00216 SUGAR TRANSPORT 1 204 232 87.93% 0 67 0.00%PS00216 SUGAR TRANSPORT 1 0 232 0.00% 19 67 28.36%PS60004 OMEGA CONOTOXIN 21 37 56.76% 0 7 0.00%PS60004 OMEGA CONOTOXIN 1 37 2.70% 1 7 14.29%

PS00105 AA TRANSFER CLASS 1 0 102 0.00% 1 6 16.67%PS00217 SUGAR TRANSPORT 2 0 176 0.00% 12 115 10.43%PS00475 RIBOSOMAL L15 0 281 0.00% 1 13 7.69%PS00903 CYT DCMP DEAMINASES 0 125 0.00% 2 3 66.67%PS01108 RIBOSOMAL L24 0 303 0.00% 1 7 14.29%PS01348 MRAY 2 0 243 0.00% 1 3 33.33%

The table is divided into 3 parts-1st Part: 1 pattern for which no miniprofile could be built.-2nd Part: 2*5 miniprofiles, 1st row: miniprofile that detects few FP sequences but fails onmatching many TP. -2nd row: another miniprofile for the same pattern as 1st row, detectsmore FP but misses on less TP sequences.-3rd Part: 6 miniprofiles that can perhaps be entered into Incmatch by increasing the cut-off.

PS00014, no miniprofile could be constructed. This pattern is very short andrestricted at the carboxy-terminus:

[KRHQSA] - [DENQ] - E - L>

Anchoring profiles to the amino- or carboxy-terminus was not part of the au-tomatic procedure developed in this work, although it has been tried for thisspecific pattern. The reason why no miniprofile could be created for this spe-cific pattern might be that the 15 amino acids preceeding the pattern in the TPsequences do not show enough conservation.For the next 5 patterns, increasing the specificity of the miniprofiles by increas-ing the cut-off meant that just a few TP sequences would be detected. Thelast 6 miniprofiles will perhaps be entered as if a solution can be found, i.e.

22

increasing the cut-off so that no FP sequences will be detected and failing tomatch a low proportion of TP sequences.

3.1.5 Statistics about the 1309 patterns and the 1271 miniprofilesincluded in Incmatch

Table 6: Statistics about the 1309 patterns and 1271 profiles entered into Inc-

match.

1309 1271 1237 34 38 72

FP 391 (30%) 355 (28%) 324 (26%) 31 (90%) 36 (97%) 67 (93%)FN 862 (66%) 826 (65%) 793 (64%) 33 (97%) 36 (95%) 69 (96%)

FP and FN 346 (26%) 312 (25%) 282 (23%) 30 (88%) 34 (89%) 64 (89%)FP or FN 907 (69%) 869 (68%) 835 (68%) 34 (100%) 38 (100%) 72 (100%)Min. TP 3 3 3 29 37 29Max. TP 8560 1815 1450 1815 8560 8560Av. TP 115 92 86 303 870 603

Median TP 48 46 45 149 283 217S.D. TP 320 140 118 436 1530 1179

Min. cut-off n.a. 6.50 6.51 6.50 n.a. n.a.Max. cut-off n.a. 50.22 50.22 21.00 n.a. n.a.Av. cut-off n.a. 13.50 13.63 8.98 n.a. n.a.

Median cut-off n.a. 12.58 12.74 8.00 n.a. n.a.S.D. cut-off n.a. 5.06 5.04 2.79 n.a. n.a.cut-off < 8.5 n.a. 212 (17%) 185 (15%) 27 (79%) n.a. n.a.cut-off < 8.0 n.a. 141 (11%) 140 (11%) 1 (3%) n.a. n.a.

• Columns1309: Total number of patterns.1271: Total number of miniprofiles inserted into Incmatch.1237: Number of all-TP miniprofiles inserted into Incmatch.34: Number of not-all-TP miniprofiles inserted into Incmatch.38: Number of patterns for which no miniprofiles were inserted into Incmatch.72 = 34 + 38

• RowsMedian (50thpercentile): value that divides the distribution into halves.S.D. = standard deviation.

• Cellsn.a.: not applicable

As can be observed in Table 6 1st part, the average number of TP and themedian are both the highest in group ’38’ and both the lowest in group ’1237’.On one hand, this suggest that patterns having many TP matches are moredifficult to derive a profile from; on the other hand, in the methods we tried,

23

most limited the number of sequences in the seed alignment to 100 or 200, whichimplies that for a pattern matching 1000 sequences, the proportion of TP usedin the seed alignement could not be higher that 10% or 20%., which suggeststhat the the upper limit of sequences used in MSA should be set proportionallyto the number of TP.The fact that a pattern matches FP sequences also appears to correlate withthe difficulty to construct an all-TP profile, since the proportion of patternsmatching FP sequences is the highest in group ’38’ and the lowest in group’1237’. On one hand this is logical, because a miniprofile associated with apattern is not at risk to match FP sequences for the pattern in question, if thepattern does not have any FP (Table 6, 1st part); on the other hand, consideringthat both the number of TP and the proportion of patterns with FP are highin the group ’72’ suggests that we are faced with lower complexity patterns andthis could explain why it is harder to construct miniprofiles for them. Table6, 2nd part, shows different statistics regarding the cut-off of the miniprofilesthat were inserted into Incmatch. Profiles with cut-off lower than 8.5 are moreat risk to match sequences that would not be detected by the pattern they arederived from.

3.2 Comparing T-Coffee, ProbCons and ClustalW

For the construction of miniprofiles, no significant difference was observed amongthe three MSA tools. Miniprofile construction following MSA with ClustalW,ProbCons and T-Coffee led to respectively 1180/1309, 1172/1309 and 1170/1309all-TP miniprofiles (Figure 3).

Figure 3: all-TP miniprofiles produced following MSA by T-Coffee (default method), Prob-Cons (method 1) and ClustalW (method 8).

24

3.3 Incmatch

1271 miniprofiles were included in the database: 1237 all-TP and 34 not-all-TP

ones. Up to this point miniprofiles have been tested only by performing pfsearch

on TP, FP and FN of the patterns they were derived from. In order to assesswhether miniprofiles can be used in connexion with patterns to give a status totheir matches in TrEMBL, miniprofiles had to be tested on the whole Swiss-Protdatabase.

3.3.1 Testing Miniprofiles in Swiss-Prot

In order to be sure that miniprofiles matching pattern FP sequences would beexcluded, pattern FP sequnces were searched with corresponding miniprofiles.It was found that 2 miniprofiles, MP00010 and MP00201 detected pattern FPmatches. These FP sequences are proteins that were in Swiss-Prot at the twominiprofiles were include in Incmatch. In addition, 2 other patterns have beendeleted from PROSITE: PS00403 and PS00404.The results of the three following equations were calculated for the remaining1267 miniprofile/pattern couples:

Number of pattern TP detected by the porofile at level 0

Number of pattern TP(7)

Number of pattern (TP + FN) detected by the profile at level 0

Number of pattern (TP + FN)(8)

Total number of profile matches at level 0

Number of pattern(TP + FN) detected by the profile(9)

Equation (7) computes the proportion of pattern TP matches that are detectedby the miniprofile at level 0. Out of the 1267 miniprofile/pattern couples, 1218(96%) displayed a result of 1 and 49 (4%) a result lowere than one. These49 miniprofiles correspond either to not-all-TP miniprofiles that were includedinto Incmatch (33) or, for the remaining 16; they are miniprofiles derived frompatterns for which new protein entries were added in Swiss-Prot that the corre-sponding miniprofiles are not able to detect.Supposing that the manually assigned status of a pattern match is correct, anideal profile would give a result of 1 for equation (8), since it would detect all thepattern TP matches as well as all its FN. It gave a result of 1 for 655 minipro-files, which means that 52% of miniprofiles were able to detect all the patternTP matches and all their FN too. Among these 1267 patterns, 803 had FN and597/803 (74%) miniprofiles were capable of matching at least 1 FN.Results for equation (9) represent the total number of profile matches at level0 over the number of profile matches that are either pattern TP matches or

25

pattern FN. It gave a resullt of 1 for 1157 miniprofiles, which means that 91%of the miniprofiles do not detect other sequences than pattern TP matches orpattern FN.Statistics concerning the results for these three equations are reported in thefollowing table (Table 7). If we consider only miniprofiles matching at least 80%

Table 7: Statistics on Patterns and Miniprofiles in Swiss-Prot.

Equation (7) Equation (8) Equation (9)

Min. 0.50 0.13 1.00Max. 1.00 1.00 7.73Av. 1.00 0.95 1.03

Mode 1.00 1.00 1.00Perc. 1.00 0.95 1.00

Median 1.00 1.00 1.00Perc. 1.00 1.00 1.00Perc. 1.00 1.00 1.00S.D. 0.02 0.09 1.00

-Mode: most frequent result in the distribution.-Perc. 25 = 25thpercentile: indicates that 25% of the results are below this value.-Median (50thpercentile): result that divides the distribution into halves.-Perc. 75 = 75thpercentile: indicates that 75% of the results are below this value.-Perc. 90 = 90thpercentile: indicates that 90% of the results are below this value.-S.D: standard deviation.

of pattern TP matches and an upper limit of 1.20 as a result for equation (9),1230 miniprofiles can be used to assess pattern matches in TrEMBL. For theother 35 miniprofiles, they will have to be checked one by one. Figure 4 showsa graph of the ditribution of results for equation (8), it can be observed thatmost miniprofiles display a result close to 1.

26

Figure 4: Frequency of results for equation (8) for the 1267 miniprofiles. 612 display a result< 1.00 and 655 display a result of 1.00.

3.3.2 TrEMBL

The results of the 2 following equations were calculated in order to comparepattern and miniprofile matches in TrEMBL:

Number of profile matches at level 0

Number of pattern matches(10)

Number of profile matches at level − 1


Number of profile matches at level 0 + Number of profile matches at level − 1


These equations could not be calculated for 13 pattern-miniprofile couples, be-cause 13 patterns had absolutely no match in TrEMBL. Among these 13 pat-terns, most are toxin family signatures and only one corresponding profile hadmatches at level 0. This miniprofile, MP60027, derived from a contryphan fam-ily signature matched 11 proteins in TrEMBL while PS60027 matched nothing.Statistics on the results obtained for equations (10), (11) and (12) can be seenin Table 8. Altogether, 17 miniprofiles had no matches at level 0, among which12/17 for which the corresponding patterns had no matches either.

27

Table 8: Statistics on Patterns and Miniprofiles in TrEMBL.

Equation (10) Equation (11) Equation (12)

Min 0.00 0.00 0.20Max 14.00 514.00 515.00

Average 1.145 8.11 9.25Mode 1.00 0.00 2.00

Perc. 25 0.875 0.88 2.01Median 1.052 2.05 3.21Perc. 75 1.232 5.738 7.06Perc. 90 1.598 16.591 18.06

S.D. 0.858 26.210 26.16

4 Discussion

The major drawback of patterns is that they do not produce scores associatedwith their matches. This is not an issue in Swiss-Prot since the database ismanually curated and pattern matches are thoroughly checked by annotators,whereas in TrEMBL, the anteroom of Swiss-Prot the entries are only computer-annotated, it is hard to know what level of confidence can be placed in a patternmatch.To overcome this, we have developed a new tool to evaluate PROSITE patternmatches in TrEMBL based on Swiss-Prot information. This tool consists in theautomatic generation of one generalized profile for each pattern, given that pro-files as opposed to patterns produce a score associated with their matches. Thegoal is to confirm pattern matches as TP if the produced profile detects themtoo or as Unknown in cases where the profile does not. Hence this tool shouldfacilitate the work of curators in the process of transfering TrEMBL entries toSwiss-Prot.The emphasis was put on specificity first, having that in mind no miniprofilethat detected FP matches for a particular pattern in Swiss-Prot was includedin Incmatch. Out of the 1271 miniprofiles that were included into Incmatch,355 were derived from patterns matching FP and the total number of patternFP matches was 4562. We succeded in producing profiles that did not detectany of these FP sequences in Swiss-Prot at least at the time they were includedin Incmatch. Later on in this work, it was discovered that two miniprofiles didin fact match FP sequences, these entries had not been present at the time theminiprofiles in question were created and they will not be used to assess patternmatches in TrEMBL.Sensitivity was important too but not as critical as specificity, because mistak-ingly confirming a pattern match as a TP is more problematic than not detectinga pattern TP match and tagging it as Unknown.Out of the 1309 patterns present in PROSITE, 1271/1309 (97%) automatically

28

produced miniprofiles were included in Incmatch and at the very leat 1230/1271(97%) can be used in TrEMBL. 14 other patterns have a profile in PROSITEthat can be used to assess their matches in TrEMBL, which brings the numberof patterns for which no solution has been found to 26. 12 are presented in Table5 of the Results section, 12 others are in Tables 10 and 11 of the Appendix andthe 2 last ones are MP00010 and MP00201, the 2 miniprofiles that have to beremoved from Incmatch. Out these 26 patterns, 25 have FP matches and theirtotal number of FP matches sums up to 1404.We have tried several methods to automatically produce profiles that would beable to match all the TP matches and no FP matches of their correspondingpatterns. Such miniprofiles were designated as all-TP ones.Applying the default method already led to the production of the vast majorityof all-TP miniprofiles. Then our strategy was to change one parameter or oneprogram at the time, and to perform the new method only on the patterns forwhich no all-TP miniprofiles had been created in the preceding methods. Thetwelve other methods tested enabled us to obtain 67 more all-TP profiles.It is difficult to outline what parameters and/or programs had more impact onthe quality of the produced profiles since no method, except three, was testedon the whole set of 1309 patterns.The only three methods applied to the whole set (default method with T-Coffee

[17], method 2 with ProbCons [18] and method 9 with ClustalW [16]) gave sim-ilar results with a slight, non significant advantage of ClustalW over the twoother MSA tools. According to E. Tillier, whose reasearch team tested ninealignment tools [20], ProbCons aligns sequences slightly better than T-Coffee

and ClustalW appears to be the worst of the three tools. However, their resultsindicate that when applied to short sequences, ClustalW shows good accuracy.This is consistent with our results since in our case MSA were performed on theregion matched by the pattern plus 15 amino acids in N-terminus and 15 aminoacids in the C-terminus leading to miniprofiles that have an average length of47 match positions, which is somewhat short.Among the 1267 miniprofiles that were included in Incmatch and did not detectany FP, 803 were derived from patterns having FN and only 597/803 (74%)miniprofiles were able to detect some or all of the FN. Moreover, the rate ofpatterns with FN was very high (96%) for the 72 patterns from which either nominiprofile was inserted into Incmatch or not-all-TP miniprofiles were includedin Incmatch.On one hand, this suggests that FN should be taken into account in the au-tomatic procedure in order to increase miniprofiles performance, on the otherhand; including FN in the MSA may increase the rate of miniprofile FP.Including FN could involve finding the parts of FN sequences that are closestto the pattern and include them in the seed alignment [22] or alternatively pro-ducing a miniprofile, then performing a pfsearch of FN sequences, restart thewhole procedure from the start while including FN sequences in the seed align-ment, and so on in an iterative fashion until no new FN sequence is detected bypfsearch.If we were to continue this work, the most interesting pespective would be to

29

apply the strategy on bigger regions, even maybe on whole domains containingthe regions matched by patterns. In this case, the domain boundaries could beobtained from a protein structure database such as CATH [23] or SCOP[24].Finally, this approach shows that when the dataset is of good quality it is pos-sible to automatically construct profiles that are consistent in terms of bothspecificity and sensitivity.

30

5 Appendix

ID RIBOSOMAL S8 MP;MATRIX.AC MP00053;DT APR-2007 (CREATED); APR-2007 (DATA UPDATE); APR-2007 (INFO UPDATE).DE Ribosomal protein S8 signature.MA /GENERAL SPEC: ALPHABET=’ABCDEFGHIKLMNPQRSTVWYZ’; LENGTH=47;MA /DISJOINT: DEFINITION=PROTECT; N1=5; N2=43;MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR;

R1=5.2529759; R2=0.0151300; TEXT=’-LogE’;MA /CUT OFF: LEVEL=0; SCORE=674; N SCORE=15.441; MODE=1; TEXT=’!’;MA /CUT OFF: LEVEL=-1; SCORE=83; N SCORE=6.5; MODE=1; TEXT=’?’;MA /DEFAULT: M0=-9; D=-20; I=-20; B0=-200; B1=-200; E0=-200; E1=-200; MI=-105;

MD=-105; IM=-105; DM=-105;MA /I: B0=0; BI=-105; BD=-105;MA /M: SY=’R’; M=-20,-10,-30,-10,0,-20,-20,0,-30,30,-20,-10,0,-20,10,70,-10,-10,-20,-20,-10,0;MA /M: SY=’V’; M=-5,-11,-19,-10,-5,-13,-22,-18,5,-8,-4,-2,-13,-19,-11,-10,-4,-2,14,-29,-13,-8;MA /M: SY=’Y’; M=-17,-23,-28,-26,-22,28,-31,6,8,-16,8,5,-21,-29,-15,-14,-21,-10,-2,16,54,-22;MA /M: SY=’E’; M=2,-5,-18,-3,11,-22,-15,-13,-14,1,-16,-11,-6,-6,-1,-6,3,-3,-8,-29,-18,5;MA /M: SY=’K’; M=-4,2,-25,1,3,-27,2,-8,-27,13,-27,-14,5,-12,4,7,3,-6,-21,-25,-16,3;MA /I: I=-6; MD=-32;MA /M: SY=’W’; M=-5,-17,-27,-19,-14,2,-18,-5,-11,-6,-12,-8,-14,-21,-9,-6,-10,-9,-11,22,17,-12; D=-6;MA /I: I=-6; MI=-32; IM=-32; DM=-32;MA /M: SY=’E’; M=-4,6,-24,6,13,-24,-16,-8,-19,8,-18,-12,2,-11,4,1,0,-1,-14,-28,-14,8;MA /M: SY=’E’; M=-8,14,-26,16,19,-28,-14,-2,-27,14,-24,-16,12,-10,9,9,0,-6,-23,-29,-16,14;MA /M: SY=’L’; M=-12,-19,-25,-23,-14,-5,-28,-14,11,-8,13,10,-14,-23,-7,1,-18,-9,5,-20,-2,-13;MA /M: SY=’P’; M=-12,-18,-32,-15,-7,-7,-23,-9,-12,-7,-11,-8,-17,27,-9,-10,-13,-8,-17,-15,0,-11;MA /M: SY=’R’; M=-11,-15,-25,-15,-5,-10,-22,-9,-8,3,6,2,-13,-18,-1,11,-15,-8,-8,-20,-4,-4;::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::MA /M: SY=’K’; M=-4,4,-25,5,14,-25,-16,-6,-24,16,-19,-12,2,-11,7,13,-2,-6,-18,-26,-14,10;MA /M: SY=’K’; M=-1,-5,-24,-7,4,-23,-16,-8,-19,15,-14,-7,-3,-14,9,14,-4,-6,-15,-22,-12,6;MA /M: SY=’G’; M=-7,6,-27,2,-5,-26,21,2,-31,1,-28,-16,16,-18,-3,0,0,-12,-28,-27,-18,-5;MA /M: SY=’V’; M=-4,-29,-18,-33,-28,0,-33,-28,34,-24,18,14,-25,-25,-24,-23,-15,-3,35,-25,-5,-28;MA /M: SY=’G’; M=1,-10,-29,-10,-19,-30,67,-20,-39,-20,-30,-20,0,-20,-19,-20,1,-19,-29,-20,-30,-19;MA /M: SY=’G’; M=0,-10,-30,-10,-20,-30,70,-20,-40,-20,-30,-20,0,-20,-20,-20,0,-20,-30,-20,-30,-20;MA /M: SY=’E’; M=-10,3,-29,9,38,-28,-21,-3,-24,14,-19,-13,-2,-6,18,8,-3,-9,-22,-27,-16,27;MA /M: SY=’V’; M=-3,-28,-17,-31,-25,2,-30,-26,28,-24,22,13,-26,-26,-23,-22,-16,-4,30,-25,-5,-25;MA /M: SY=’L’; M=-10,-30,-22,-33,-23,7,-33,-23,29,-29,40,20,-27,-27,-20,-23,-27,-10,17,-20,0,-23;MA /M: SY=’C’; M=9,-15,34,-23,-21,-13,0,-23,-23,-22,-16,-15,-12,-25,-22,-23,-2,-8,-10,-30,-21,-21;MA /M: SY=’Y’; M=-14,-16,-24,-18,-15,22,-24,2,-6,-9,-3,-4,-13,-25,-12,-6,-10,-3,-8,6,39,-15;MA /M: SY=’V’; M=-5,-30,-16,-33,-29,9,-32,-28,30,-23,13,12,-26,-27,-28,-22,-14,-4,38,-22,-3,-29;MA /M: SY=’W’; M=-14,-27,-30,-29,-23,22,-23,-7,-10,-17,-9,-9,-25,-28,-17,-16,-22,-16,-16,62,41,-20;MA /I: E0=0; IE=-105; DE=-105;CC /RESCALED BY=” /home/grp-sprot/www/psmaker/bin/pfscale

/home/grp-sprot/hamap/tmp/pfcal4yCQ.score1 /home/grp-sprot/hamap/tmp..”;CC /GENERATED BY=”/home/sun-000/SwissProt/grp-sprot/www/psmaker

/bin/pfmake -2b -H2.0 /home/sun-000/Stagiaire/cuche/AGAIN/AG PS..”;CC /WARNING FP FOUND BY THE PROFILE:/CC /WARNING TP NOT FOUND BY THE PROFILE:/DO PDOC00052;//

Figure 5: Miniprofile MP00023. ID: Identification, AC: Accession number, DT: Date, DE:description, MA: matrix, CC: Comment, DO: Document. The horizontal rule ’::::’, indicatesthat a few MA lines have been deleted to fit the page. A full explanation of the generalisedprofile syntax is available at http://www.expasy.ch/txt/profile.txt.

31

Report for PS00053—————————–-alignment method: t coffee-matrix making: pfmake-psa2msa, trim95, trim20, trim90-pfmake: -2b, -H2.0, blosum45-TP.extraction: 357 sequences-FP.extraction: 1 sequences-FN.extraction: 20 sequences-trim95: 208 sequences-trim20: 208 sequences-trim90: 100 sequences

TRUE POSITIVES-TP-pf search(-a): 357 sequences-TP-Lowest Score (-a): 15.451-Cut-Off: 15.441 (15.451-0.01)-TP-Sequence(s) with Lowest Score (-a):R15A2 ARATH O82205 start.99-116.stopRS8 SULAC O05636 start.103-120.stop-SCORE=int(N SCORE-R1)/R2)+1-Old Raw Score: 215-New Raw Score: 674

FALSE POSITIVES-FP-pf search(-a): 1 sequence-FP-Highest Score(-a): 5.949-FP-Sequence(s) with Highest Score (-a):5.949 46 CSCB ECOLI P30000 PS00053 359 376

-Number of TP not found by the profile: 0-Number of FP found by the profile: 0-Number of FN found by the profile: 17

Figure 6: Report for miniprofile MP00353. It indicates all the programs and the parametersthat were used to construct the profile. The last line indicates that out of 20 known FN thatwere no protein matches for PS00053, MP00053 was able to detect 17.

32

Figure 7: Workflow for the 1202 all-TP profiles obtained following MSA with T-Coffee thatwere inserted into Incmatch. Miniprofiles obtained by a. default method, b. default method,c. method 2, d. method 3, e. method 4 and f. method 2.

Figure 8: Workflow for the 31 all-TP profiles obtained following MSA with ProbCons thatwere inserted into Incmatch. Miniprofiles obtained by a. method 1 and b. method 5.

Figure 9: Workflow for the 4 all-TP profiles obtained following MSA with ClustalW thatwere inserted into Incmatch. Miniprofiles obtained by a. method 9 and b. method 10.

33

Table 9: Details concerning the 34 not-all-TP profiles inserted into Incmatch.

pattern missed % missed TP FPAC ID TP TP TP lowest score highest score cut-off

MP01036 HSP70 3 MP 537 1 0.19% 5.63 7.08 8.00MP00527 RIBOSOMAL S14 MP 316 1 0.32% 4.22 4.75 8.00MP01347 MRAY 1 MP 206 1 0.49% 6.05 / 8.00MP00059 ADH ZINC MP 257 2 0.78% 4.29 6.13 8.00MP00070 ALDEHYDE DEHYDR CYS MP 253 2 0.79% 5.04 6.32 8.00MP00690 DEAH ATP HELICASE MP 102 1 0.98% 6.37 7.26 8.00MP00189 LIPOYL MP 182 2 1.10% 3.5 4.62 8.00MP00062 ALDOKETO REDUCTASE 2 MP 77 1 1.30% 15.43 16.84 17.00MP00086 CYTOCHROME P450 MP 715 10 1.40% 3.35 7.35 8.00MP00228 TUBULIN B AUTOREG MP 206 3 1.46% 9.7 9.91 10.00MP00262 INSULIN MP 199 3 1.51% 5.53 4.95 8.00MP00118 PA2 HIS MP 279 5 1.79% 4.95 5.28 8.00MP00455 AMP BINDING MP 327 6 1.83% 2.87 5.87 8.00MP00136 SUBTILASE ASP MP 95 2 2.11% 6.42 6.98 8.00MP00551 MOLYBDOPTERIN PROK 1 MP 73 2 2.74% 6.53 7.30 8.00MP00659 GLYCOSYL HYDROL F5 MP 73 2 2.74% 6.24 6.18 8.00MP00290 IG MHC MP 391 12 3.07% 5.33 7.50 8.00MP00213 LIPOCALIN MP 92 3 3.26% 4.39 7.35 8.00MP00713 NA DICARBOXYL SYMP 1 MP 61 2 3.28% 6.06 / 8.00MP00572 GLYCOSYL HYDROL F1 1 MP 69 3 4.35% 4.93 5.98 8.00MP00445 FGGY KINASES 2 MP 134 6 4.48% 8.68 11.82 12.00MP00064 L LDH 216 10 4.63% 12.61 20.74 21.00MP00116 DNA POLYMERASE B MP 116 6 5.17% 6.02 6.24 8.00MP00267 TACHYKININ 56 3 5.36% 6.29 7.92 8.00MP00178 AA TRNA LIGASE I MP 1815 101 5.56% 5.1 5.73 8.00MP00306 CASEIN ALPHA BETA MP 30 2 6.67% 4.78 7.69 8.00MP60014 ALPHA CONOTOXIN MP 29 2 6.90% 6.22 / 6.50MP00879 ODR DC 2 2 92 10 10.87% 6.5 6.65 8.00MP00165 DEHYDRATASE SER THR MP 91 14 15.38% 8.61 10.78 11.00MP00092 N6 MTASE1 63 27 16.56% 4.24 6.83 8.00MP00430 TONB DEPENDENT REC 1 MP 38 7 18.42% 4.83 7.58 8.00MP00539 PYROKININ MP 53 13 24.53% 4.45 3.92 8.00MP00010 ASX HYDROXYL MP 1469 429 29.20% 6.58 10.01 10.10MP00198 4FE4S FERREDOXIN MP 1497 459 30.66% 7.19 9.72 9.82

Average 303 33 6% 6.14 7.24 8.98

-TP: Number of TP for the corresponding pattern.-missed TP: Number of TP that the miniprofile failed matching.-TP lowest score: lowest score obtained when performing pfsearch of TP sequences.-FP highest score: highest score obtained when performing pfsearch of FP sequences.

34

Table 10: 22/72 patterns having one profile in the same PROSITE documenta-tion.

pattern AC pattern ID TP patternprofile AC profile ID TP pro/TP pat

PS01186 EGF 2 2596PS50026 EGF 3 1769/2596 (69%)PS00022 EGF 1 2291PS50026 EGF 3 1636/2291 (71%)PS00134 TRYPSIN HIS 43PS50240 TRYPSIN DOM 432/432 (100%)PS00135 TRYPSIN SER 432PS50240 TRYPSIN DOM 432/432 (100%)PS00107 PROTEIN KINASE ATP 1999PS50011 PROTEIN KINASE DOM 1999/1999 (100%)PS00108 PROTEIN KINASE ST 1762PS50011 PROTEIN KINASE DOM 1762/1762 (100%)PS00109 PROTEIN KINASE TYR 463PS50011 PROTEIN KINASE DOM 463/463 (100%)PS00141 ASP PROTEASE 393PS50175 ASP PROT RETROV 121/393 (31%)PS00652 TNFR NGFR 1 127PS50050 TNFR NGFR 2 64/127 (50%)PS01359 ZF PHD 1 294PS50016 ZF PHD 2 186/294 (63%)PS01360 ZF MYND 1 46PS50865 ZF MYND 2 36/46 (78%)PS00661 FERM 2 69PS50057 FERM 3 58/69 (84%)PS00197 2FE2S FER 1 221PS51085 2FE2S FER 2 206/221 (93%)PS00041 HTH ARAC FAMILY 1 128PS01124 HTH ARAC FAMILY 2 122/128 (95%)PS00012 PHOSPHOPANTETHEINE 285PS50075 ACP DOMAIN 282/285 (99%)PS00036 BZIP BASIC 155PS50217 BZIP 155/155 (100%)PS01159 WW DOMAIN 1 181PS50020 WW DOMAIN 2 181/181 (100%)PS00518 ZF RING 1 528PS50089 ZF RING 2 528/528 (100%)PS00237 G PROTEIN RECEP F1 1 1730PS50262 G PROTEIN RECEP F1 2 1730/1730 (100%)PS00018 EF HAND 1 2041PS50222 EF HAND 2 2041/2041 (100%)PS00211 ABC TRANSPORTER 1 3308PS50893 ABC TRANSPORTER 2 3308/3308 (100%)PS00028 ZINC FINGER C2H2 1 8560PS50157 ZINC FINGER C2H2 2 8560/8560 (100%)

-TP pro/TP pat: proportion of pattern TP that are matched by the profile.-In bold: profiles that could be associated with the pattern in their respective row.-In italics: profiles that cannot be associated with the pattern in their respective row becausethey detect FP matches.-Profile PS50026 appers in the same documentation as two patterns: PS001186 and PS00022.-Profile PS50240 appears in the same documentation as two patterns: PS00134 and PS00135.-Profile PS50011 appears in the same documentation as three patterns: PS00107, PS00108and PS00109.

35

Table 11: 4/72 patterns having 2 or 3 profiles in the same PROSITE documen-tation.

pattern AC pattern ID TP patternprofile AC profile ID TP pro/TP pat

PS00435 PEROXIDASE 1 156PS50873 PEROXIDASE 4 152/156 (97%)PS00435 PEROXIDASE 1 156PS50292 PEROXIDASE 3 9/156 (6%)

PS00436 PEROXIDASE 2 142PS50873 PEROXIDASE 4 98/142 (69%)PS00436 PEROXIDASE 2 142PS50292 PEROXIDASE 3 0/142 (0%)

PS00383 TYR PHOSPHATASE 1 218PS50056 TYR PHOSPHATASE 2 218/218 (100%)PS00383 TYR PHOSPHATASE 1 218PS50055 TYR PHOSPHATASE PTP 126/218 (58%)PS00383 TYR PHOSPHATASE 1 218PS50054 TYR PHOSPHATASE DUAL 52/218 (24%)

PS00678 WD REPEATS 1 1604PS50294 WD REPEATS REGION 1599/1604 (100%)PS00678 WD REPEATS 1 1604PS50082 WD REPEATS 2 1382/1604 (86%)

-TP pro/TP pat: proportion of pattern TP that are matched by the profile.-In bold: profiles that could be associated with the pattern in their respective row.-Profiles PS50292 and PS50873 both appear in the same documentation together with twopattern: PS00435 and PS00436.

36

6 Acknowledgments

I would like to thank:

• Professor Amos Bairoch for having given me the opportunity to do myinternship in this stimulating environment, which is the SIB.

• Christian Sigrist who is the best superviser one can hope to work for.

• Nicolas Hulo who is the best superviser one can hope to work for.

• Tania Lima for having the kindness and especially the courage to correctmy paper.

• Edouard De Castro, Petra Langendijk-Genevaux, Lorenzo Cerutti andCedric Notredame and Fabrice David for helping me during the project.

• Claudia Sapsezian for welcolming me in the SIB.

• Karin Sonesson, Salvo Paesano and Alessandro Innocenti for advising meto avoid ”rm *” in the shell.

• Dolnilde Dornevil, Laure Verbregue, Gwennaelle Delbard for enlighteningthe reception.

• Isabelle Cusin for complimenting me on my necklaces.

• Lydie Bougueleret for her precious advices in general.

• Harris Procopiou for having refused this internship position and for teach-ing me how to program.

• Gregory Loichot for his use of ”J’suis dechire, mec !” and for teaching mehow to program.

• Nathalie Lachenal for being the best coach since Rocky’s coach.

• Anaıs Mottaz for being one of my closest friends.

• Mickey, because if I don’t mention him, he’ll be mad.

37

References

[1] Bairoch A. (1992) PROSITE: a dictionary of sites and patterns in proteins.Nucleic Acids Research. 20(Supplement), 2013-2018.

[2] Sigrist, C.J.A., Cerutti L., Hulo N., Gattiker A., Falquet L., Pagni M.,Bairoch A. & Bucher P. (2000). PROSITE: a documented database usingpatterns and profiles as motif descriptors. Briefings in Bioinformatics. 3(3),265-274.

[3] Bucher, P. & Bairoch, A. (1994) A generalized profile syntax for biomolec-ular sequence motifs and its function in automatic sequence interpretation.International Conference on Intelligent Systems for Molecular Biology. 2,53-61.

[4] Bairoch, A., Bucher, P. & Hofmann, K. The PROSITE database, its statusin 1995.(1996) Nucleic Acids Research. 21(1), 189-196.

[5] Wu C.H., Apweiler R., Bairoch A., Natale D.A., Barker W.C., BoeckmannB., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J.,Mazumder R., O’Donovan C., Redaschi N. & Suzek B. (2006). The UniversalProtein Resource (UniProt): an expanding universe of protein information.Nucleic Acids Research. 34(Database issue), D187-91.

[6] Berman H.M., Battistuz T., Bhat T.N, Bluhm W.F., Bourne P.E.,Burkhardt K., Feng Z., Gilliland G.L., Iype L., Jain S., Fagan P., Mar-vin J.,Padilla D., Ravichandran V., Schneider B., Thanki T., Weissig H.,Westbrook J.D. & Zardecki C. (2000) The Protein Data Bank. Acta crystal-

lographica. D58, 899-907.

[7] Mulder, N.J., Apweiler , R., Attwood, T.K., Bairoch, A., Barrell, D., Bate-man, A., Binns D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley,R.R., Courcelle, E., Das, U., Durbin, R., Falquet, L., Fleischmann, W.,Griffiths-Jones, S., Haft, D., Harte, N., Hulo, N., Kahn, K., Kanapin, A.,Krestyaninova, M., Lopez, R., Letunic, I., Lonsdale, D., Silventoinen, V.,Orchard, S.E., Pagni, M., Peyruc, D., Ponting, C.P., Selengut, J.D., Ser-vant, F., Sigrist, C.J.A, Vaughan, R. & Zdobnov, E.M. (2003). The InterProDatabase, 2003 brings increased coverage and new features. Nucleic Acids

Research. 31(1), 315-318.

[8] Mulder, N.J., Apweiler , R., Attwood, T.K., Bairoch, A., Bateman, A.,Binns D., P., Bork, P., Bulliard, V., Cerutti, L., Copley, R., Courcelle,E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, Gough,J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A.,Labarga, A., Langendijk-Genevaux, P.S., Lonsdale, D., Lopez, R., Letunic,I, Madera, M, Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell,A., Nikolskaya, A.N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J.D.,Sigrist, C.J.A, Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H. & Yeats, C.

38

(2007). New developments in the InterPro database. Nucleic Acids Research.35(Database issue), D224-8.

[9] http://www.ebi.ac.uk/t-coffee/

[10] Sibbald, P.R. & Argos, P. (1990). Weighting aligned protein or nucleicacid sequences to correct for unequal representation. Journal of Molecular

Biology. 216(4), 813-818.

[11] Langendijk-Genevaux, P.S. (2003) Weighting protein sequences by a phy-logenetic tree to improve profile methods. Report of Practical Training per-

formed at Prosite, Swiss-Prot group, Swiss Institute of Bioinformatics.

[12] Gribskov, M., Luthy, R. & Eisenberg, D. (1990) Profile analysis. Methods

in Enzymolology 183, 146-159.

[13] http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Similarity/simsrch13.html

[14] Luthy, R., Xenarios, I. & Bucher, P. (1994) Improving the sensitivity of thesequence profile method. Protein Science. 3(1), 139-146.

[15] http://myhits.isb-sib.ch/doc/scores.html

[16] Thompson, J.D., Higgins D.G. & Gibson T.J. (1994). CLUSTAL W: im-proving the sensitivity of progressive multiple sequence alignment throughsequence weighting, position-specific gap penalties and weight matrix choice.Nucleic Acids Research 12(22), 4673-80.

[17] Notredame, C., Higgins, D.G. & Heringa J. (2000). T-Coffee: A novelmethod for fast and accurate multiple sequence alignment. Journal of Molec-

ular Biology. 308(1), 205-17.

[18] Do, C.B., Mahabhashyam, M.S., Brudno, M. & Batzoglou S. (2005). Prob-Cons: Probabilistic consistency-based multiple sequence alignment. Genome

Research. 15(2), 330-40.

[19] Pearson, W.R. (1995) Comparison of methods for searching protein se-quence databases. Protein Science. 4(6), 1145-60.

[20] Nuin, P.A.S., Wang, Z. & Tillier, E.R.M. (2006). The accuracy of severalmultiple sequence alignment programs for proteins.BMC Bioinformatics. 7, 471.

[21] Sigrist C.J.A., De Castro, E., Langendijk-Genevaux, P.S., Le Saux, V.,Bairoch, A. & Hulo, N. (2005). ProRule: a new database containing func-tional and structural information on PROSITE profiles. Bioinformatics.21(21), 4060-6.

[22] Bulliard, V. (2004) Implementation d’une methode pour la mise a jourautomatique des patterns PROSITE, Report of Practical Training performed

at Prosite, Swiss-Prot group, Swiss Institute of Bioinformatics in Geneva.

39

[23] http://www.cathdb.info/latest/index.html

[24] http://scop.berkeley.edu/

40

Miniprofiles: A New Tool to Evaluate PROSITE Pattern Matches

Documents

Transcript of Miniprofiles: A New Tool to Evaluate PROSITE Pattern Matches