REVIEW Unrestricted identiﬁcation of modiﬁed proteins...

REVIEW

Unrestricted identification of modified proteins using

MS/MS

Erik Ahrne, Markus M .uller� and Frederique Lisacek

Swiss Institute of Bioinformatics, Proteome Informatics Group, Geneva, Switzerland

Received: July 13, 2009

Revised: September 21, 2009

Accepted: October 19, 2009

Proteins undergo PTM, which modulates their structure and regulates their function. Esti-

mates of the PTM occurrence vary but it is safe to assume that there is an important gap

between what is currently known and what remains to be discovered. The highest throughput

and most comprehensive efforts to catalogue protein mixtures have so far been using MS-

based shotgun proteomics. The standard approach to analyse MS/MS data is to use Peptide

Fragment Fingerprinting tools such as Sequest, MASCOT or Phenyx. These tools commonly

identify 5–30% of the spectra in an MS/MS data set while only a limited list of predefined

protein modifications can be screened. An important part of the unidentified spectra is likely

to be spectra of peptides carrying modifications not considered in the search. Bioinformatics

for PTM discovery is an active area of research. In this review we focus on software solutions

developed for unrestricted identification of modifications in MS/MS data, here referred to as

open modification search tools. We give an overview of the conceptually different algorithmic

solutions to evaluate the large number of candidate peptides per spectrum when accounting

for modifications of unrestricted size and demonstrate the value of results of large-scale open

modification search studies. Efficient and easy-to-use tools for protein modification discovery

should prove valuable in the quest for mapping the dynamics of proteomes.

Keywords:

Bioinformatics / MS/MS / Protein identification / PTM

1 Introduction

Proteins undergo PTM, which modulates their structure

and regulates their function. The identification of protein

modifications is of paramount importance to understand

the regulation and dynamics of a proteome. A range of

methodologies have been designed for the discovery of

PTMs in the past decades. The detection of a single PTM

has hinged on structural methods like X-ray or NMR and on

chemical methods involving labeling and separation tech-

niques (e.g. LC). Besides, PTM annotation in protein

sequences can also be produced with algorithms that

attempt to predict the presence of certain modifications

based on sequence patterns (see http://www.expasy.org/

tools/]ptm, for a comprehensive list of such tools).

Today MS is a central technology for the identification of

PTMs [1–4]. The highest throughput and most compre-

hensive efforts to catalog protein mixtures, including the

identification of PTMs, have so far been based on shotgun

proteomics [5]. For instance protein phosphorylation,

playing a major role in signaling networks, was exten-

sively mapped in large-scale MS studies [6–8]. Similarly,

the role of glycosylation as a functional modulation of

secreted or membrane proteins has been investigated using

MS/MS [9].

In a study by MacCoss et al. [10] it was estimated that

proteins on an average carry three PTMs. In another paperAbbreviations: CAD, collision-activated dissociation; ECD, elec-

tron capture dissociation; ETD, electron transfer dissociation;

FDR, false discovery rate; HCD, higher energy C-trap dissocia-

tion; OMS, open modification search; PFF, peptide fragment

fingerprinting; PSM, peptide spectrum match; SIMS, sequential

interval motif search

�Additional corresponding author: Dr. Markus M .uller

E-mail: [email protected]

Correspondence: Erik Ahrne, Swiss Institute of Bioinformatics,

1 rue Michel Servet, CH-1211 Geneve 4, Switzerland

E-mail: [email protected]

Fax: 141-22-379-58-58

& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Proteomics 2010, 10, 671–686 671DOI 10.1002/pmic.200900502

[11] the number of modified variants in proteomic samples

was predicted to be as many as 8–12 per unmodified peptide

although most of these modified species are presumed to be

present at very low concentration. On the other hand, less

than 1% of all proteins in UniProtKB/Swiss-Prot are anno-

tated with a PTM [12]. The protein modification databases

Unimod [13] and RESID [14] contain approximately 500

different modification entries each. Estimates of the PTM

occurrence vary but it is safe to assume that there is an

important gap between what is currently known and what

remains to be discovered.

The analysis of high-throughput data depends heavily on

bioinformatics. The standard approach to analyse MS/MS

data is to use a peptide fragment fingerprinting (PFF) tool

such as Sequest, MASCOT, Phenyx, X!Tandem, Sonar or

OMSSA [15–20]. These tools all have the limitation that the

user has to define potential modifications prior to the

search, and therefore often fail to identify an important

fraction of the MS/MS data set.

In this review we focus on software solutions developed

for unrestricted identification of modifications, here referred

to as open modification search tools (OMS), where no apriori assumptions on the modification state of the sample

needs to be made by the user. These tools are designed to

identify already known modification types annotated in

databases as well as previously unknown post-translational

and chemically induced modifications.

We will first raise issues relating to experimental set-ups

for PTM detection as well as to conventional identification

methods. We will then detail the various OMS strategies

defined by different authors to discover PTMs in high-

throughput data. Finally we will discuss the results of some

large-scale OMS studies.

2 Background

2.1 Finding PTMs using high-throughput MS/MS

In a standard bottom-up shotgun MS/MS experiment the

protein sample is fractionated. The proteins are excised and

typically digested into peptides using a protease such as tryp-

sin. In the next step, peptides in the peptide mixture are

usually separated by reversed-phase LC coupled on line to a

mass spectrometer where the peptides are ionized and some of

them fragmented by collision-activated dissociation, CAD (or

CID) MS/MS. The mass to charge ratios of possible peptide

fragments (annotated as b- and y-ions. etc.), predominately

formed through backbone cleavage at the amide bond, are

then calculated and matched against the experimental spectra.

In an MS/MS experiment a modified variant of a peptide can

be distinguished from the unmodified variant. For a peptide

with one modified amino acid typically 50% of the peaks in the

mass spectrum will be shifted by the m/z value of the modi-

fication compared with the spectrum of the unmodified

peptide (see Fig. 1A). The spectrum of a modified peptide may

also contain modification specific neutral loss peaks and

diagnostic ions (see Fig. 1B) [21].

2.2 Problems detecting PTMs with MS

Important limitations exist when it comes to detecting post-

translationally modified proteins using MS/MS. Many

PTMs are only present at low concentrations and the mass

spectrometer may fail to select these peptides for fragmen-

tation [22]. Some modifications are known to hamper the

enzymatic protein cleavage leading to the generation of long

and highly charged peptides and consequently spectra,

which are difficult to interpret, e.g. when glucose attaches to

K or R these tryptic cleavage sites are likely to be missed.

Furthermore some modifications induce an unexpected

fragmentation of the peptide and mass spectra that are

difficult to analyse [23, 24].

A shotgun experiment produces too much data for

manual interpretation of each spectrum. Several algorithms

have been developed to automate the analysis of the MS

output data where peptide candidates are assigned to an

experimental spectrum ranked by an empirical or statistical

peptide spectrum match (PSM) score [10–34]. An MS/MS

spectrum can typically be explained by an unmodified

peptide, a peptide modified during the sample preparation

or a peptide carrying one or more PTMs. As will be

discussed later in this review the presence of modifications

complicates the bioinformatic analysis, partly because the

number of candidates per spectrum increases dramatically

and partly because the fragmentation patterns of certain

modified peptides is difficult to predict.

3 Limitations in the classic approach toMS/MS data analysis

3.1 Restricted modification searches

Commonly used restricted PFF search tools screen the

experimental MS/MS data against a user-selected protein

database. The protein sequences are digested into peptides insilico in accordance with the cleavage rules of the protease

used in the sample preparation step of the experimental

workflow. For each peptide a theoretical spectrum is gener-

ated and the similarity between an experimental spectrum

and all candidate theoretical spectra is calculated. This

approach has proven to be very successful for the identifica-

tion of unmodified peptides and their corresponding proteins.

Classical PFF tools can screen for a restricted list of

modifications. Before initiating the search the user is asked

to specify a list of search parameters. These parameters may

include protein sequence database, taxonomy, precursor

mass tolerance, peptide fragment mass tolerance, etc. The

user may then configure the search tool to look for a list of

known amino acid modifications, where a modification can

672 E. Ahrne et al. Proteomics 2010, 10, 671–686


A

B

Figure 1. A spectrum of the non-modified peptide ADLMLYVSK (top) aligned with a spectrum of a modified variant of the same peptide

(bottom), carrying a methionine oxidation (16 Da). (A) b4 to b8-ions and y6- to y8-ions, in the modified spectrum, annotated with an � are

shifted with the m/z of the modification relative to the corresponding non-modified fragment ions. (B) displays a spectrum of the non-

modified peptide VSFELFADK (top) aligned with a spectrum of a modified variant of the same peptide (bottom), phosphorylated at serine

(80 Da). Spectra of phosphorylated peptides are typically dominated by ions resulting from the neutral loss of phosphoric acid, whereas

sequence-specific fragment ions formed through cleavage of the peptide backbone amide bonds are of low intensity. Spectra down-

loaded from http://www.peptideatlas.org/speclib/ (ISB_Hs_plasma_20070706_PUBLIC.zip and ISB_Hs-phospho_20080428.zip).

Proteomics 2010, 10, 671–686 673


be specified as fixed or variable. A fixed modification is

assumed to be present on all instances of the residue that

carries it. Typically cysteines are carboxyamidomethylated

during the sample preparation and the reaction is close to

100% and therefore this modification should be specified as

fixed. Including fixed modifications does not increase the

complexity of the search. In contrast, variable modifications

are not necessarily present on a specific residue. In almost

all cases one would have to specify methionine oxidation as

a variable modification. When setting up the search tool to

look for variable modifications more candidate peptides will

be considered per spectrum, leading to longer search times

and possibly more false-positive identifications [35], as will

be described in more detail. In a typical LC/MS experiment

where the data is analysed with a classical PFF tool, 5–30%

of the spectra are expected to be identified [36–38]. Many

reasons could explain the failure of a software-driven inter-

pretation among which the most common are:

1 noisy spectra or spectra from impurities

2 database error (erroneous or missing sequences, incorrect

annotation, etc)

3 unexpected large parent mass measurement error, e.g. by

detecting the wrong isotope

4 unusual enzymatic cleavage

5 modification/mutations. In fact, an important part of the

unidentified spectra could be spectra of peptides carrying

modifications or mutations, not considered in the search.

3.2 The search space explosion

There are limitations on the number of variable modifications

that can reasonably be included when using a standard PFF

tool. If we were to look for modifications in a more or less

unsupervised way one could simply imagine including all

known modifications as variable. However, this becomes

problematic for two reasons: first, search times scale linearly

with the number of candidate peptides considered

per experimental spectrum and would become much larger.

Second, more candidates generate more random high-scoring

matches leading to a worse separation between the score

distribution of true and false-positive matches. A large overlap

between these distributions means less identifications at a

given false discovery rate (FDR) [39]. The search space

explosion limits the use of a typical PFF tool for PTM

discovery and is a major issue to be tackled when designing an

MS/MS identification algorithm that can identify protein

modifications in an unsupervised manner. Figure 2 illustrates

how allowing for more variable modifications may lead to a

loss in the total number of identified spectra when calculating

the global FDR based on a typical decoy search [40]. A

manually annotated test data set of 3269 spectra, from a

sample containing over 200 yeast proteins, acquired on a

QqTOF instrument [41] was analysed with MASCOT search-

ing a concatenated target decoy database. Six searches were

performed allowing for none to five common variable modi-

fications, and the total number of confident PSMs for each

individual search was registered at an estimated FDR of 0.05.

4 Unrestricted identification of modifiedpeptides

4.1 Software workflow approaches to PTM

discovery

An extensive collection of identification tools has been

developed to perform OMS. A comprehensive selection of

these tools is presented in Table 1. As will be described in

the following sections a workflow approach is commonly

used when searching for modifications in an unrestricted

manner. Some of the software developed for this purpose is

fully integrated in MS/MS peptide identification platforms

including multiple steps of an identification workflow,

whereas other tools are isolated modules handling only a

part of the identification process.

Figure 3 shows the main three steps of a typical PTM

identification workflow. The first step includes a database

reduction, where the original database is reduced to a list of

candidate peptides or proteins that are potentially present in

the sample. Next, the spectra are matched and scored

against this database. A delicate part of this step is assigning

Figure 2. Allowing for more variable modifications may lead to a

loss in the total number of identified spectra when calculating

the global FDR based on a typical decoy search. A manually

annotated test data set of 3269 spectra, from a sample contain-

ing over 200 yeast proteins, acquired on a QqTOF instrument

was analysed with MASCOT searching a concatenated target

decoy database. Six searches were performed allowing for none

to up to five common variable modifications and the total

number of confident PSMs for each individual search was

registered at an estimated FDR of 0.05.



the modification to the correct residue. In a third

post-processing step the results are either manually or

automatically validated.

Below we review the different parts of an OMS workflow

and present the various solutions suggested by the tools

listed in Table 1. Our selection of OMS tools reflects on the

one hand tool popularity and on the other hand the extent of

conceptually different algorithmic strategies implemented

for the MS/MS data analysis.

4.2 Step 1: Database reduction

Screening a large experimental data set for modifications of

unrestricted mass against the full proteome of an organism

would in most cases be impractical in terms of both

computational time and FDR, as described above. It is

therefore meaningful to limit the search to a list of peptides

or proteins that are likely candidates for modifications. A

drastic but accurate filtering allows for more sophisticated

and computationally intensive scoring of the experimental

spectrum to the few remaining candidates. We refer to two

filtering strategies: sequence tag extraction and multiple

round processing.

4.2.1 Sequence tag extraction

Mann et al. [42] presented a filtering approach based on

sequence tag extraction. Tag extraction is a database-inde-

pendent peptide sequencing strategy where peptide sub-

sequences are derived directly from the spectrum by linking

peaks with a mass difference corresponding to the mass of

an amino acid. In theory the full peptide sequence could be

found in this manner, a technique known as de novosequencing [43–49], but the accuracy is strongly limited by

the quality of the data. However, extracted peptide sub-

Table 1. OMS tools

Search tool Filtering Modif.specificscoring

Test dataa) Download Citationsb)

Popitam [53] Multiple roundsc)1tags No Q-TOF www.expasy.ch/tools/popitam/ ��

MODi [55] Multiple roundsc)1tags No Q-TOF prix.uos.ac.kr/modi/ �

InsPecTd) [56] Multiple rounds1tags Yes Q-TOF, ion-trap proteomics.ucsd.edu/Software/Inspect.html

��

P-Mod [65] Multiple roundsc) No Ion-trap, SIM data www.mc.vanderbilt.edu/lieblerlab/p-mod.php

��

Modiro [71] Multiple roundsc) Yes Ion-trap www.modiro.com:8080/licenseserver/home.seam

�

VEMS 3.0 [21] Multiple rounds Yes Q-TOF yass.sdu.dk ��

ProteinProspector[72]

Multiple rounds Yes QSTAR, Q-TOF,ion-trap

prospector.ucsf.edu �

Bonanza [62] Multiple roundsc)�

spectral libraryNo Q-TOF Contact authors �

ModifiComb [37] Multiple roundsc)�

peptide DBNo LTQ-FT www.bmms.uu.se/Software.htm ��

SeMoP [69] Multiple rounds No Ion-trap biomed.umit.at/upload/semop.zip �

TwinPeaks [68] – No Ion-trap www.utoronto.ca/emililab/twinpeaks.htm

�

Interrogator [70] – No QSTAR Contact authors ��

OpenSea [54] Tags No Q-TOF, ion-trap Contact authors ��

SIMS [57] Peak intervals, tagsc) No Q-TOF, ion-trap webprod1.ccbr.utoronto.ca �

a) Refers to the data type the tool was tested on in the original publication.b) Reflects the number of citations per year since the year of the original publication (�,��,��,��) (0�4, 5�9, 10�14,15�).c) Filtering with external tool.d) With MS alignment

Figure 3. A typical OMS workflow includes a

database reduction step followed by

enumeration and scoring of all peptide

candidates. A list of post-processing algo-

rithms has been developed to further refine

the search results output.

Proteomics 2010, 10, 671–686 675


sequences of three to four amino acids, so called sequence

tags, have proven to be effective filters in order to reduce the

number of candidate peptides for a spectrum [42–52]. The

OMS tools Popitam [53], OpenSea [54] and MODi [55] use

tag extraction to narrow down the list of candidate peptides

per spectrum. To take full advantage of tag extraction

filtering the retrieval of peptides from the reference database

matching the sequence tags has to be fast. InsPecT [56] a

highly cited identification platform implements an efficient

tag extraction algorithm followed by a rapid trie-based scan

of the database, to extract the peptide candidates for a

spectrum.

The recently published OMS algorithm sequential interval

motif search, SIMS, filters the database based on ranked

single amino acid inter-peak intervals, ordered by the inten-

sities of the two associated peaks [57]. Every spectrum is

subjected to strict filtering, keeping only the high intensity

peaks, and converted to a symmetrical spectrum by introdu-

cing ‘‘ghost’’ peaks to generate more complete b- or y-ion

ladders. In default mode 400 intervals are extracted per spec-

trum used to retrieve the 500 most probable peptide candi-

dates per spectrum from the reference protein database. The

authors demonstrate that this filtering process can be further

improved in combination with tag extraction coupling SIMS

with the publicly available PepNovo algorithm [47].

4.2.2 Multiple round processing

Another common database filtering technique is to discard

all proteins that cannot be identified in a very sensitive,

fast and restricted PFF search [58–60]. In a first step the

data set is screened against the full database with strict

search parameters meaning at most one missed cleavage

and one or two variable modifications. A reduced database

is compiled from all proteins with at least one confidently

matched peptide in this first round search. A second

round search where PTM parameters are loosened can

then be launched. Most of the OMS tools listed in Table 1

employ this strategy, and some in combination with tag

extraction.

ModifiComb [61] a simple and fast OMS tool uses an

even stronger database filter and suggests to screen for

modifications only with peptides confidently identified in a

first round search. The underlying assumption is that most

PTMs are present at sub-stoichiometric ratios. Consequently

no new peptide species will be identified in the modification

search. Similar approaches were used by the Bonanza

algorithm [62] and presented in Ahrne et al. [63], which takes

full advantage of the fact that the fragmentation patterns of

many modified peptides are similar to that of the unmodi-

fied variants (see Fig. 1A). Here, a spectral library, which is a

list of annotated spectra identified in a prior PFF search, is

exhaustively screened for modifications.

Several of the OMS tools do not include a database-

filtering step, but it is then recommended that this is done

externally. The Swiss Protein Identification Toolbox, Swis-

sPIT [64], provides an automated solution where several

identification tools can be combined in multiple-round

workflows. The results are merged to create a reduced but

comprehensive protein database, which is passed on to a

second identification step where it is explored for modifi-

cations with the OMS tools Popitam [53] and InsPecT [56].

Efficient filtering will speed up the OMS search while

boosting its discriminatory power. It is important to point

out that filtering criteria largely influence the outcome of the

modification search. When multiple round processing is

used, the overall results of the OMS workflow will depend

on the quality of the reduced size protein database. An

important question that arises, but which is given little

attention in the literature, is what selection criteria to apply

on the proteins included in this database. The identification

of modified peptides may increase the sequence coverage of

individual proteins, thus validating new proteins and resol-

ving ambiguities where single peptides map to multiple

protein entries. Using strict protein selection criteria means

ignoring modifications that may occur on certain proteins

while limiting protein validation based on the discovery of

modified peptides. In contrast, loose selection criteria may

lead to an increased number of false-positive PSMs and

longer search times. How this trade-off should be dealt with

remains to be investigated.

4.2.3 Enumerating the candidate peptides

For most OMS tools the mass range of the modifications to

be included in the search is a user-defined variable. Typically

the mass range of a modification search is set between �100

and 300 Da. For each experimental spectrum all peptides in

the database within the given modification mass range are

evaluated. It is assumed that the difference in precursor

mass between the query spectrum and the unmodified

candidate peptide corresponds to the mass of one or more

modifications. However, allowing for more than one modi-

fication per peptide is not recommended as many spectra

can find a high-scoring match by chance if multiple

unrestricted modifications are included. An OMS in its

simplest form enumerates all modification scenarios given a

peptide sequence and a modification mass, where a modi-

fication scenario corresponds to every possible location of

the modification along the peptide sequence. Next, a theo-

retical spectrum is generated for each modified peptide

candidate and scored against the experimental spectrum

(see Fig. 4). This exhaustive approach is slow and the time

complexity for a single modification is quadratic in the

length of the peptide. Some tools apply empirical rules in

order to restrict the number of considered scenarios. For

example, P-mod [65] does not consider scenarios where the

absolute value of a negative mass shift is greater than the

amino acid side chain mass at a specific sequence location.

MS-Alignment [66], the OMS algorithm of the InsPecT



platform, speeds up the search for the optimal attachment

position of a PTM using dynamic programming (imple-

mented in linear time), based on an improved version of the

spectral alignment algorithm introduced by Pevzner et al.[67].

When tag extraction filtering has been used the number

of scenarios can be limited based on the extracted sequence

tags. Popitam [53] uses sequence tags and full candidate

peptide sequences to construct a spectrum graph where a

path represents a possible modification or mutation

scenario. Each path of the graph is evaluated in order to find

the optimal peptide candidate. Another approach employed

by OpenSea [54] and MODi [55] aligns multiple sequence

tags extracted from a single spectrum and regions that

cannot be matched to a database sequence are assumed to

be either amino acid substitutions or modifications.

TwinPeaks [68] and the recently published search tool

SeMoP, Search for Modified Peptides, [69] use a concep-

tually different algorithm where constant mass shifts

between the peaks of an experimental and a theoretical

spectrum are sought for, which would indicate that the

candidate peptide is modified at a given position.

The assignment of a modification to a specific amino acid

residue is prone to errors. It is common that a spectrum

does not contain enough information to pinpoint the posi-

tion of a modification and the positional assignment leads to

so-called delta correct identifications: correct peptide and

correct modification mass but erroneous site. An early OMS

tool, Interrogator [70], designed for fast processing by

effectively indexing a sequence database, focused on

assigning a modification to a region of the peptide sequence

rather than a specific amino acid. Labile modifications are in

general difficult to position without further empirical data

since the resulting spectra often contain few shifted peaks

relative to the unmodified peptide spectrum. Examples are

O-glycosylation, sulphation and phosphorylation that are

commonly eliminated as neutral losses during fragmenta-

tion.

4.3 Step 2: Matching

4.3.1 Similarity scoring

A number of scoring algorithms have been developed in

order to determine the similarity between an experimental

spectrum and a theoretical spectrum. In its simplest form

the theoretical spectrum contains the calculated b- and y-ion

A

B C

Figure 4. An illustration of a

simple exhaustive OMS

search. The list of peptides

within the specified modifi-

cation mass tolerance, typi-

cally [�150, 300] Da are

extracted (I). A tag extrac-

tion step will further narrow

down the number of candi-

date peptides per spectrum.

An OMS in its simplest form

enumerates all modification

scenarios given a peptide

sequence and a modifica-

tion mass, where a modifi-

cation scenario corresponds

to every possible location of

the modification along the

peptide sequence (II). Next,

a theoretical spectrum is

generated for each modified

peptide candidate and

scored against the experi-

mental spectrum (III).

Proteomics 2010, 10, 671–686 677


fragments and the similarity score is based on the shared

peak count between the compared spectra [20, 60]. In

contrast, some scoring schemes include a multitude of

fragment types including a-ions and x-ions internal frag-

ments and extract several features in addition to the shared

peak count such as the ratio of experimental to theoretical b-

and y-ions, the length of continuous ion-series, matching

peak intensity, etc. A score based on the combined measure

of these features is then derived [15, 17].

Spectral library search tools typically use scoring schemes

taking into account the intensity of the spectrum peaks.

Bonanza [62] uses a dot product-based scoring, which is the

normalised scalar product of the compared spectra repre-

sented as multidimensional vectors.

4.3.2 Modification specific scoring

As mentioned earlier, simply assuming that a modification

induces a shift of fragment masses is not a valid model

for all types of modifications. Although a lot remains to be

understood when it comes to peptide fragmentation

patterns, more sophisticated scoring schemes should be

defined, especially when considering that different peptides

may have the same, or very similar, theoretical b- and

y-ion series, e.g. the mass of a methylated asparatic acid

equals the mass of glutamic acid. Some software tools

include modification-specific scoring that takes into

account fragment types associated with a particular modi-

fication. The commercially distributed OMS tool Modiro

(formerly PTM-Explorer) (Protagen AG) [71] uses predefined

search strategies to look for some of the common modifi-

cations annotated in Unimod [13]. A specialised phosphor-

ylation scoring considers the presence of the usually

observed neutral loss signals of the phosphate group in

the fragmentation spectrum. VEMS 3.0 [21] is another

identification algorithm that considers an extensive list

of PTM-specific neutral losses and diagnostic fragment

ions, designed to distinguish between near isobaric modi-

fications such as Lysine acetylation and Lysine tri-methyla-

tion. Lysine acetylation exhibits a diagnostic ion at m/z126.0913 whereas the spectrum of a Lysine tri-methylated

peptide commonly contains a neutral loss peak at m/z59.0735. Other tools like ProteinProspector [72] can be

configured to look for unknown modifications while

targeting labile modifications. InsPecT also has a sophisti-

cated scoring algorithm accounting for the fragmentation

probabilities of different instrument types and the effects of

certain PTMs.

4.4 Step 3: Post-processing

Most identification tools attempt to assign a statistical

quality measure to a PSM such as a p-value/e-value or FDR.

These measures are commonly estimated by screening the

experimental data set against a database of randomised

peptide sequences. The performance of a tool can be eval-

uated based on the trade-off between error rate and sensi-

tivity, often visualised in a receiver operator curve [17].

Despite efforts to reduce the number of candidate peptides

in the database and the development of improved scoring

algorithms, high-scoring random matches remain a

problem when looking for PTMs in an unsupervised

manner. Strict error rate thresholds often lead to important

loss of sensitivity and the opposite will return disputable

matches that have to be manually validated.

A range of post-processing algorithms has been devel-

oped to further refine the search results. In addition to

presenting a sophisticated alignment algorithm to compare

theoretical and experimental spectra Tsur et al. [66] proposed

a new way to tackle loss of discriminatory power in open

modification searches. Their approach relies on the tabula-

tion of all the mass shifts reported by the software for each

amino acid in a large data set. Assuming that incorrect

modification mass assignments will distribute randomly

across all amino acids, those matches containing modifica-

tions of residues reported multiple times are more likely to

represent true modified peptides.

PTMFinder [73] elaborates on the same idea of studying

the global evidence for modifications found in the experi-

mental data. Here the focus is shifted from the significance

of individual PSMs to modification site scoring. The authors

acknowledge that open modification searches are error

prone, but try to make use of the fact that large data sets

tend to contain a lot of redundant and complementary

spectra. The post-processing tool is designed to be a plug-

gable module that can handle the output of any OMS tool,

although demonstrated in combination with MS-Alignment

and the InsPecT scoring. The basic idea is to group the

identification results by modification site and extract the

evidence for each occurrence. Features such as the number

of overlapping peptides carrying the same modification and

the same modified peptide found in multiple charge states

are used to train a Support Vector Machine used to distin-

guish between false and correct modification site assign-

ments.

ComByne [74] is another interesting post-processing

module scoring modification sites. In a first step, peptide

match probabilities are adjusted for peptide length, missed

cleavages and modifications. The rationale here is that

chances of randomly matching a semi-tryptic, modified, and

unmodified peptide with an elevated score differ, since, e.g.the database may contain substantially more modified

peptide candidates than unmodified peptides. Furthermore,

short peptides may randomly be assigned high scores solely

based on correct prefix or suffix amino acids. Similar

reasoning is used by other post-processing tools such as

PeptideProphet [32] and Panoramics [75]. A novelty in

ComByne is that it refines the p-value of a peptide match

based on the difference in measured retention time and the

predicted retention time of the peptide candidate. In addi-



tion, p-values are adjusted based on corroboration, where in

analogy with PTMFinder the global evidence of a modifi-

cation site is accounted for. The discovery of overlapping

peptides would boost the p-value of both peptides. Similarly,

when the unmodified and modified variants of the same

peptide species are found the p-values are recalculated.

ComByne can also score phosphorylation sites. An identi-

fication of ASLGS[�18]LEGEASSPK becomes more believ-

able if another spectrum matches ASLGS[180]

LEGEASSPK, because phosphorylated serine has a common

neutral loss of 98 Da.

5 Improving modification discovery

5.1 Mass-accuracy and modification discovery

The mass spectrometer type used to analyse samples

dramatically influences the number of candidate peptides

per spectrum in a restricted PFF search. An experimental

spectrum with a precursor mass of 1000.48 Da has

approximately 90 times more unmodified peptide candi-

dates in UniProtKB/Swiss-Prot (Yeast) if produced on a low

mass accuracy ion-trap instrument (precursor mass accu-

racy 1/� 2 Da) compared with an FT-ICR (precursor mass

accuracy 1/� 0.006 Da) (See Table 2). Exact precursor mass

measurements naturally speeds up the bioinformatic part of

an LC-MS workflow, when analysing the data with a clas-

sical PFF tool. However, substantially higher precursor

mass precision does not by default lead to a dramatic

increase in the number of identified peptides and proteins.

Haas et al. [76] investigated the benefits of high-mass

accuracy measurements analysing different data sets with

an LTQ-FT instrument where the FT-ICR part of the

instrument was either not exploited or used for MS scan

survey. The MS/MS data of a complex peptide mixture

derived from the yeast proteome was submitted to

SEQUEST [15] with search parameters adapted to the

instrumentation. Interestingly, despite a dramatic search

space reduction only 10% more peptide identifications were

produced when the FT was turned on. The advantage of the

FT-ICR was proved important mainly for assigning MS/MS

spectra with low signal-to-noise ratios. SEQUEST returned

100% more confident peptide matches in the high mass

accuracy data when analysing a yeast sample enriched for

phosphorylated peptides. These spectra are often dominated

by ions resulting from the neutral loss of phosphoric acid (as

seen in Fig. 1B), whereas sequence-specific fragment ions

formed through cleavage of the peptide backbone amide

bonds can be of low intensity. In an OMS, higher precursor

mass accuracy does not reduce the number of candidate

peptides per spectrum. Consequently search times are not

reduced but more discriminative grouping of PSMs with the

same modification mass can be obtained and modification

masses can be accurately mapped to known modifications

annotated in databases. To the best of our knowledge no

study has been published investigating the benefits of high

precursor mass accuracy measurements and OMS.

High mass accuracy measurements of fragment ions have

been shown to be of great importance for peptide identifica-

tion based on de novo sequencing [49, 77]. In a recent publi-

cation [78], the benefits of fragment ions acquired at high

mass accuracy, using an LTQ-Orbitrap, was further investi-

gated. Data acquired on an Orbitrap and linear ion trap data,

from the same Pseudomonas aeruginosa sample, was analysed

in restricted searches with Phenyx [33] (Genebio SA), setting

the fragment mass tolerance to 12 ppm and 0.5 Da, respec-

tively. As the overall search space was increased, by allowing

for multiple missed tryptic cleavage sites and a variable

(methyl-ester) modification on amino acids D, E, S and T,

high accuracy fragment mass measurements resulted in

more identified spectra, at low FDRs. The spectra not confi-

dently identified in the Phenyx searches were extracted and

submitted to the OMS tool Popitam [53] searching the

Orbitrap data with a fragment mass tolerance of 0.01 Da

(minimum accepted by the software tool) and 0.5 Da for the

ion-trap data. While no additional confident identifications

were found in the ion-trap data set, peptides with mass shifts

corresponding to oxidation, adduction of sodium, methyla-

tion and dethiomethylation were frequently observed in the

high fragment mass accuracy Orbitrap data.

Fortunately, the use of high mass accuracy instruments

becomes more and more common in proteomics labs. For

an in-depth review covering the topic of accurate mass

accuracy in proteomics experiments we recommend [79].

5.2 Extended identification workflows

A number of clever data pre-processing steps have been

proposed. They are worth considering for reducing

computational time and possibly increasing PTM identifi-

cation rate. MS/MS experiments often generate redundant

data sets containing multiple spectra of the same peptides.

On this basis, a fast clustering algorithm was presented,

Table 2. A comparison of the number of precursors consideredfor three types of searchesa)

Instrument Unmodified Five variablemodificationsb)

OMSc)

Ion-trapd) 442 2838 232 216FTe) 10 110 232 216

a) Listing the number of fully tryptic candidate peptides, in theUniprot database (Yeast, 6594 proteins), for a 1000.48 Daprecursor, produced on an ion-trap and a high mass accuracyLTQ-FT instrument, respectively, and analysed in three typesof searches.

b) Deamidation (N, Q), Methylation (H, K), Oxidation (M, W),Acetylation (L), Sodium adduct (D, E).

c) 1/� 100 Da modification mass tolerance.d) 1/� 2 Da precursor mass tolerance.e) 1/� 0.006 Da precursor mass tolerance.

Proteomics 2010, 10, 671–686 679


grouping similar spectra and replacing them with a single

representative spectrum [80]. It is demonstrated how data

sets of over ten million spectra could be reduced by a factor

of ten, significantly speeding up the following database

search. This of course is especially meaningful when using

OMS tools as the analysis time per spectrum is much larger

than when performing a restricted search with a classical

PFF tool.

Bern et al. [81] describe a spectrum quality assessment

tool and show how it may be of particular interest in a pre-

processing step preceding a modification search: spectra

assigned a high quality but not identified in a restricted

search can often be explained by modified peptides,

although many modifications produce low-quality spectra.

We strongly recommend reading Tanner et al. [82]

providing a well-written user manual of the InsPecT soft-

ware platform. The authors suggest that an OMS could be

succeeded by a restrictive follow-up search. Here the data is

further explored for some of the more frequent or especially

interesting modifications listed in the OMS output. The

restrictive search can be more sensitive in detecting modi-

fications with known effects on fragmentation and multiple

modifications per peptide are allowed. SeMoP [69] automates

such a three-step strategy. First, a standard database search

is performed with SEQUEST [15]. Second, all peptides

corresponding to the identified proteins or only the identi-

fied peptides from step one are exhaustively explored for

modifications. Finally, data are re-submitted to SEQUEST

for a targeted search for specific modifications found in the

unrestricted search, allowing for multiple modifications perpeptide.

5.3 Experimental set-up for PTM identification

Studying the results presented in the papers describing the

OMS tools discussed above it becomes clear that the vast

majority of the modifications identified are in fact not post-

translational but rather chemical modifications induced by

sample preparation such as Cysteine Carbox-

yamidomethylation, N-terminus and Lysine Carbamylation,

Oxidation of Methionine and Sodium and Potassium

adducts. PTMs especially those previously unknown can be

expected to be poorly abundant. This is illustrated in Fig. 5

showing the modification mass distribution of a typical

OMS search. The histogram displays the confident PSMs

returned when screening a human blood plasma sample

data set produced on an Orbitrap instrument, and analysed

with a novel library search-based OMS tool, QuickMod

(Ahrne et al. manuscript in preparation).

The detection of low-abundant PTM peptides is very

limited when analysing complex mixtures because these

peptides are overshadowed by unmodified peptides and

peptides modified during the sample preparation. In addi-

tion, post-processing algorithms tend to favour the identifi-

cation of abundant modifications for which extensive

evidence can be found. These factors complicate the

successful discovery of rare PTMs. The detection problem

can be partly improved by various promising sample

preparation techniques such as anti-phosphoamino acid

antibodies for protein isolation [83, 7] and affinity-based

enrichment of modified proteins or peptides [6, 84, 85].

Seo et al. present a protocol targeting low-abundance

PTMs [86]. The authors describe a clever LC-MS workflow,

Selectively Excluded Mass Screening Analysis (SEMSA)

where samples are analysed by an LC-ESI-qTOF in multiple

rounds. For each round, the precursor masses of spectra

confidently identified, using MODi [55], are added to a mass

exclusion list allowing for the fragmentation of precursor

ions of low intensities. A similar set-up, presented in

Schmidt et al. [87] where unidentified MS features were

added to an inclusion list for targeted fragmentation, leads

to extensive identification of phosphorylation sites in a

protein mixture obtained from Drosophila melanogasterlysates.

Another interesting LC-MS identification workflow was

recently published by Carapito et al. [88] where spectra are

acquired under different collision conditions and a peptide

mass inclusion list is compiled based on the detection of

modification specific neutral loss fragments and reporter

ions in combination with ion signals corresponding to the

modified and unmodified peptide masses. In a final step the

peptides on the mass inclusion list are sequenced in a

directed MS/MS mode.

Figure 5. The vast majority of the modifications identified in a

typical OMS are in fact not post-translational but rather modifi-

cations induced by the sample preparation such as Cysteine

Carboxyamidomethylation, N-terminus and Lysine Carbamyla-

tion, Oxidation of Methionine and Sodium and Potassium

adducts. PTMs especially those previously unknown can be

expected to be low abundant. The histogram displays the

confident modified PSM returned when screening a human

blood plasma sample data set produced on an Orbitrap instru-

ment, and analysed with a novel library search based OMS tool,

QuickMod (manuscript in preparation).



Barsnes et al. [89] developed a software tool, Mass-

ShiftFinder, tackling the detection problem by screening

MALDI-TOF data for potentially modified peptides that then

can be selected for subsequent TOF-TOF analysis. Their

algorithm performs a blind search for modifications using

peptide mass fingerprints from two proteases with different

cleavage specificities. If the same mass shift relative to the

unmodified theoretical values is observed for both proteases,

and the peptides are overlapping, the mass shift can corre-

spond to a modification or a substitution.

Working with more than one protease is in general a

good idea in order to increase the sequence coverage of PTM

sites in the results of analysis [10]. Strong b- or y-ion peaks

on either side of a modified residue are the best evidence for

site specificity. Digesting the samples with multiple

proteases also improves the chances of producing such a

spectrum, facilitating the modification localisation problem.

Furthermore, the confident identification of low-abundance

peptides generally requires multiple replicate analyses of the

same LC-MS/MS of similar or replicate samples [65].

As mentioned earlier in this review, an additional

problem that makes the identification of real PTMs parti-

cularly tricky is the fact that some modified peptides frag-

ment poorly in the mass spectrometer. Low-energy CAD/

CID MS/MS has been, by far, the most common method

used to dissociate peptide ions for subsequent sequence

analysis. Ideally, the peptide is cleaved randomly at the

amide bonds along its backbone to produce a homologous

series of b and y-type fragment ions. The presence of

multiple basic residues prevents full fragmentation upon

collision activation/induction and directs the backbone bond

dissociation to specific sites and therefore inhibits the

production of a sufficiently diverse set of sequence ions.

Further, PTMs such as phosphorylation, sulfonation, nitro-

sylation and O- and N-linked glycosylation may similarly

redirect the sites of preferred cleavage. Often the modified

moiety is cleaved off and the peptide backbone is left more

or less intact. The resulting spectra tend to contain little

peptide sequence information and may not allow for

successful identification. In this regard, CAD/CID is most

effective for short, low-charged unmodified peptides. New

instrumentation technologies support alternative solutions

for data generation that have the potential to improve

peptide and protein identification, in particular the identi-

fication of peptides carrying labile modifications.

As n-dimensional MS has become more practical new

techniques to identify modified peptides have been

developed making up for the limitations of CAD/CID

fragmentation. Newer ion-trap instruments provide the

option of collecting MS3 spectra of abundant MS2 peaks.

Peptides carrying labile modifications have been analysed by

automated data-dependent triggering of MS3 acquisition

whenever the dominant neutral loss ion of the appropriate

mass is detected in an MS2 spectrum [8, 90, 91]. By sepa-

rately fragmenting the neutral loss ion a sequence infor-

mation-rich MS3 spectrum can be produced. Different

approaches have been tested to combine MS2 and MS3

spectra from the same peptide to improve peptide identifi-

cation [92–94].

Other methods to generate higher quality spectra of

peptides carrying labile modifications rely on new frag-

mentation techniques altogether. Electron capture dissocia-

tion (ECD) is a method for peptide dissociation, which is

relatively indifferent to peptide sequence and length while

avoiding the loss of labile modifications during fragmenta-

tion [95]. However, ECD requires an FT-ICR mass spectro-

meter. Syka et al. [96] introduced electron transfer

dissociation (ETD), which has proven useful for the identi-

fication of modified peptides and peptides with basic resi-

dues. ETD fragments peptides at the Ca-N bond by

transferring an electron from a radical anion to a protonated

peptide inducing similar fragmentation patterns to ECD,

but can be used on more widely accessible ion-trap or

Orbitrap mass spectrometers [97]. Olsen et al. [98] demon-

strated a third new PTM-friendly fragmentation technology

that takes advantage of the Orbitrap’s architecture; Higher

energy C-trap dissociation (HCD). HCD spectra show richer

fragmentation than typical CAD/CID spectra especially in

the low-mass region of the spectrum including a2, b2, y1, y2

ions and immonium ions of histidine and modified residues

such as the immonium ion of phosphotyrosine.

Many more experimental protocols have been described

in the literature aiming to increase identification of PTMs.

For details we refer to an excellent review on this topic [3].

6 OMS studies

Typical PFF tools successfully explore the unmodified frac-

tion of the experimental data but often fail to identify an

important part of the fragmentation spectra. The use of

recently developed OMS software enables a more complete

annotation of MS/MS data sets and refines our under-

standing of the biological system under investigation.

PTMFinder was evaluated on an impressively large data

set of 18 million spectra from a whole-lysate extract of

HEK293 human embryonic kidney cells. Hundreds of

previously uncharacterised modification sites were found,

most of them were phosphorylations, acetylations or

methylations, in addition to more than 900 already docu-

mented modification sites [73]. In the same publication the

authors reported the discovery of several modification sites

conserved between protein orthologues in humans and

protists based on the additional analysis of Dictyosteliumdiscoideum samples.

The ocular lens is another suitable testing ground as it is

a particularly rich source of PTMs. Since the proteins in the

mature lens fibre cells do not turnover during its long life-

time, it is expected that a wide variety of PTMs will accu-

mulate on a large number of residues. This makes lens

samples excellent candidates for exploring and evaluating

the ability of OMS tools to detect protein modifications.

Proteomics 2010, 10, 671–686 681


Several of the recently developed OMS tools have been

tested on such data sets including InsPecT [56], PTMEx-

plorer [71], SIMS [57] and SwissPIT [64]. Willmarth et al. [99]

identified a total of 155 modification sites in crystallins

analysing a human lens data set produced on two different

instruments; LCG Classic ion trap and a Q-TOF hybrid

mass spectrometer, using the InsPecT software suite. Of

these, 77 were previously reported sites and 78 newly

detected, including carboxymethyl lysine (158 Da), carbox-

yethyl lysine (172 Da) and an arginine modification of 1

55 Da. PTM-Explorer was tested on lens protein samples

from a 100-wk-old mouse [71]. Approximately 30% of all

identified peptides were found to carry modifications other

than the common sample preparation artefacts propiona-

mide on cysteines and methionine oxidation, mainly phos-

phorylation, acetylation and sodium adducts. The developers

of SIMS benchmarked their tool against InsPecT on a small

human lens protein test data set of 243 high-resolution

spectra, of modified peptides, generated on a QTOF

instrument [57]. The two algorithms returned identical

results for 80% of the spectra. Of the spectra, 17% were

identified to the same peptide and modification mass but

disagreeing on the modification site.

Other benchmarking studies also show that different

OMS tools often agree on peptide identification and modi-

fication mass but the site assignment may differ. Protein-

Prospector [72] was compared with InsPecT on a publicly

available data set (regis-web.systemsbiology.net/PublicDa-

tasets/ mix2) of 3734 spectra produced on a QSTAR

instrument, from a protein mixture of 18 standard proteins.

The two tools reported the same peptides with the same

modification mass for 1102 spectra, but approximately half

of the modification sites did not align.

Modificomb [61] was evaluated on high-accuracy FT data

where two complementary fragmentation techniques were

used; CAD and ECD revealing several previously unknown

modifications, later confirmed by MASCOT in targeted

searches, including a frequent 12 Da proline modification

detected in human saliva and a 98 Da modification on

histidine found in an E. coli sample.

7 Concluding remarks

As shown in this review bioinformatics for PTM discovery is

an active area of research. A wide range of OMS software

has been developed in recent years and the results from

various studies described in the previous section demon-

strate their capacity to analyse large data sets from complex

protein samples. OMS tools provide efficient means to

evaluate the quality of a sample by revealing modifications

induced by sample handling and preparation such as

oxidation, pyro-Glu and salt adducts. More importantly,

these tools are capable of identifying known and previously

unknown PTMs. Studies where unrestricted modification

searches were included in the data analysis pipeline show

that there is an important discrepancy between what is

documented in public databases and what remains to be

found about protein modifications.

In order to fully benefit from modification tolerant soft-

ware, the use of these analysis tools should be combined

with appropriate experimental set-ups allowing for the

fragmentation of low-abundance peptides. Employing

complementary peptide fragmentation techniques to CAD/

CID such as HCD and ETD, is desirable, as higher quality

spectra of peptides carrying labile modifications are

produced. Furthermore it is meaningful to combine

unrestricted searches for modifications with targeted studies

such as multiple reaction monitoring [100] for confirmation

and quantification of interesting modifications. Carefully

designed experimental protocols and unsupervised data

analysis in combination with verification experiments paves

the road for the study of modified protein forms as

biomarkers of disease.

Further improvements of OMS studies can be envisioned

in the years to come. Users will benefit from enhanced

computational resources enabling faster and larger scale

analysis. Part per million mass accuracy instruments

become more and more common in proteomics laboratories

and OMS tools can be further refined by taking full advan-

tage of precise peptide precursor and fragment mass

measurements. A better understanding of the effects of

modifications on peptide fragmentation should lead to more

accurate identification as better theoretical models of

modified spectra can be used for peptide matching. Inte-

grating algorithms for sequence-based prediction of modi-

fication sites such as the AutoMotif Server (AMS 2.0) [101],

as a part of an OMS workflow, may reinforce the positioning

accuracy of modified residues. It has been shown fruitful to

combine the results of multiple classical PFF tools in order

to maximise peptide discovery [102]. Therefore, parallel

searches with two or more OMS tools could similarly

improve the identification rates of modified peptides, but

strategies for unifying the search results of multiple tools

need to be evaluated. Effective filtering reducing the number

of candidate peptides per spectrum is an important part of

an OMS workflow. As discussed earlier, a thorough inves-

tigation of appropriate filtering criteria may contribute to a

better trade-off between sensitivity and error rate in OMS

searches.

In order to provide valuable guidance to investigators

interested in modification tolerant data analysis, we encou-

rage software developers to benchmark new OMS tools on

standard data sets. The Peptide Atlas Data Repository

(http://www.peptideatlas.org/repository/) provides many

high quality data sets produced on different instrument

types. A modification-rich human lens data set (PAe000316,

Wilmarth_human_lens) should be a good candidate.

Another useful resource for testing purposes is a large

collection of annotated ion-trap spectra of modified lens

peptides, downloadable at http://bioinfo2.ucsd.edu/

ModdedSpectra.html.



By improving the protein modification discovery in large

proteomics data sets OMS tools should prove valuable in the

quest for mapping the regulation and dynamics of proteomes.

The authors’ related work is part of a collaborative projectsupported by Microsoft Research.

The authors have declared no conflict of interest.

8 References

[1] Mann, M., Jensen, O. N., Proteomic analysis of post-

translational modifications. Nat. Biotechnol. 2003, 21,

255–261.

[2] Jensen, O. N., Modification-specific Proteomics: char-

acterization of post-translational modifications by mass

spectrometry. Curr. Opin. Chem. Biol. 2004, 8, 33–41.

[3] Witze, E. S., Old, W. M., Resing, K. A., Ahn, N. G. et al.

Mapping protein post-translational modifications with

mass spectrometry. Nat. Methods 2007, 4, 798–806.

[4] Pang, C. N. I., Hayen, A., Wilkins, M. R., Surface acce-

ssibility of protein post-translational modifications.

J. Proteome Res. 2007, 6, 1833–1845.

[5] Aebersold, R., Mann, M., Mass spectrometry-based

Proteomics. Nature 2003, 422, 198–207.

[6] Ficarro, S. B., McCleland, M. L., Stukenberg, P. T., Burke,

D. J. et al. Phosphoproteome analysis by mass spectro-

metry and its application to Saccharomyces cerevisiae.

Nat. Biotechnol. 2002, 20, 301–305.

[7] Steen, H., Kuster, B., Fernandez, M., Pandey, A. et al.

Tyrosine phosphorylation mapping of the epidermal

growth factor receptor signaling pathway. J. Biol. Chem.

2002, 277, 1031–1039.

[8] Beausoleil, S. A., Jedrychowski, M., Schwartz, D., Elias,

J. E. et al. Large-scale characterization of HeLa cell nuclear

phosphoproteins. Proc. Natl. Acad. Sci. USA 2004, 101,

12130–12135.

[9] Tissot, B., North, S. J., Ceroni, A., Pang, P. et al. Glyco-

Proteomics: past, present and future. FEBS Lett. 2009, 583,

1728–1735.

[10] MacCoss, M. J., McDonald, W. H., Saraf, A., Sadygov, R.

et al. Shotgun identification of protein modifications from

protein complexes and lens tissue. Proc. Natl. Acad. Sci.

USA 2002, 99, 7900–7905.

[11] Nielsen, M. L., Savitski, M. M., Zubarev, R. A., Extent of

modifications in human proteome samples and their effect

on dynamic range of analysis in shotgun proteomics. Mol.

Cell. Proteomics 2006, 5, 2384–2391.

[12] Wu, C. H., Apweiler, R., Bairoch, A., Natale, D. A. et al. The

Universal Protein Resource (UniProt): an expanding

universe of protein information. Nucleic Acids Res. 2006,

34, D187–D191.

[13] Creasy, D. M., Cottrell, J. S., Unimod: Protein modifications

for mass spectrometry. Proteomics 2004, 4, 1534–1536.

[14] Garavelli, J. S., The RESID Database of Protein Modifica-

tions as a resource and annotation tool. Proteomics 2004,

4, 1527–1533.

[15] Eng, J. K., McCormack, A. L., Yates, J. R., An approach to

correlate tandem mass spectral data of peptides with

amino acid sequences in a protein database. J. Am. Soc.

Mass Spectrom. 1994, 5, 976–989.

[16] Perkins, D. N., Pappin, D. J., Creasy, D. M., Cottrell, J. S.,

Probability-based protein identification by searching

sequence databases using mass spectrometry data. Elec-

trophoresis 1999, 20, 3551–3567.

[17] Colinge, J., Masselot, A., Giron, M., Dessingy, T. et al.

OLAV: towards high-throughput tandem mass spectro-

metry data identification. Proteomics 2003, 3, 1454–1463.

[18] Craig, R., Beavis, R. C., TANDEM: matching proteins with

tandem mass spectra. Bioinformatics 2004, 20, 1466–1467.

[19] Field, H. I., Fenyo, D., Beavis, R. C., RADARS, a bioinfor-

matics solution that automates proteome mass spectral

analysis, optimises protein identification, and archives

data in a relational database. Proteomics 2002, 2, 36–47.

[20] Geer, L. Y., Markey, S. P., Kowalak, J. A., Wagner, L. et al.

Open mass spectrometry search algorithm. J. Proteome

Res. 2004, 3, 958–964.

[21] Matthiesen, R., Trelle, M. B., Hojrup, P., Bunkenborg, J.

et al. VEMS 3.0: algorithms and computational tools for

tandem mass spectrometry based identification of post-

translational modifications in proteins. J. Proteome Res.

2005, 4, 2338–2347.

[22] Corthals, G. L., Wasinger, V. C., Hochstrasser, D. F.,

Sanchez, J. C. et al. The dynamic range of protein

expression: a challenge for proteomic research. Electro-

phoresis 2000, 21, 1104–1115.

[23] Leitner, A., Foettinger, A., Lindner, W., Improving frag-

mentation of poorly fragmenting peptides and phospho-

peptides during collision-induced dissociation by

malondialdehyde modification of arginine residues.

J. Mass Spectrom. 2007, 42, 950–959.

[24] Ghesquiere, B., Damme, J. V., Martens, L., Vandekerc-

khove, J. et al. Proteome-wide characterization of

N-glycosylation events by diagonal chromatography.

J. Proteome Res. 2006, 5, 2438–2447.

[25] MacCoss, M. J., Wu, C. C., Liu, H., Sadygov, R. et al.

A correlation algorithm for the automated quantitative

analysis of shotgun Proteomics data. Anal. Chem. 2003,

75, 6912–6921.

[26] Nesvizhskii, A. I., Keller, A., Kolker, E., Aebersold, R. et al.

A statistical model for identifying proteins by tandem

mass spectrometry. Anal. Chem. 2003, 75, 4646–4658.

[27] Nesvizhskii, A. I., Aebersold, R., Interpretation of shotgun

proteomic data: the protein inference problem. Mol. Cell.

Proteomics 2005, 4, 1419–1440.

[28] Nesvizhskii, A. I., Roos, F. F., Grossmann, J., Vogelzang, M.

et al. Dynamic spectrum quality assessment and iterative

computational analysis of shotgun proteomic data: toward

more efficient identification of post-translational modifi-

cations, sequence polymorphisms, and novel peptides.

Mol. Cell. Proteomics 2006, 5, 652–670.

Proteomics 2010, 10, 671–686 683


[29] Sadygov, R. G., Yates, J. R., A hypergeometric probability

model for protein identification and validation using

tandem mass spectral data and protein sequence data-

bases. Anal. Chem. 2003, 75, 3792–3798.

[30] Sadygov, R. G., Liu, H., Yates, J. R., Statistical models for

protein validation using tandem mass spectral data and

protein amino acid sequence databases. Anal. Chem. 2004,

76, 1664–1671.

[31] Sadygov, R., Wohlschlegel, J., Park, S. K., Xu, T. et al.

Central limit theorem as an approximation for intensity-

based scoring function. Anal. Chem. 2006, 78, 89–95.

[32] Keller, A., Nesvizhskii, A. I., Kolker, E., Aebersold, R. et al.

Empirical statistical model to estimate the accuracy of

peptide identifications made by MS/MS and database

search. Anal. Chem. 2002, 74, 5383–5392.

[33] Colinge, J., Masselot, A., Cusin, I., Mahe, E. et al. High-

performance peptide identification by tandem mass spec-

trometry allows reliable automatic data processing in

Proteomics. Proteomics 2004, 4, 1977–1984.

[34] Wan, Y., Yang, A., Chen, T., PepHMM: a hidden Markov

model based scoring function for mass spectrometry

database search. Anal. Chem. 2006, 78, 432–437.

[35] Ong, S., Mittler, G., Mann, M., Identifying and quantifying

in vivo methylation sites by heavy methyl SILAC. Nat.

Methods 2004, 1, 119–126.

[36] Shevchenko, A., Loboda, A., Shevchenko, A., Ens, W. et al.

MALDI quadrupole time-of-flight mass spectrometry: a

powerful tool for proteomic research. Anal. Chem. 2000,

72, 2132–2141.

[37] Savitski, M. M., Nielsen, M. L., Zubarev, R. A., New data

base-independent, sequence tag-based scoring of peptide

MS/MS data validates Mowse scores, recovers below

threshold data, singles out modified peptides, and asses-

ses the quality of MS/MS techniques. Mol. Cell. Proteomics

2005, 4, 1180–1188.

[38] MacCoss, M. J., Computational analysis of shotgun

Proteomics data. Curr. Opin. Chem. Biol. 2005, 9, 88–94.

[39] K .all, L., Storey, J. D., MacCoss, M. J., Noble, W. S. et al.

Assigning significance to peptides identified by tandem

mass spectrometry using decoy databases. J. Proteome

Res. 2008, 7, 29–34.

[40] Elias, J. E., Gygi, S. P., Target-decoy search strategy for

increased confidence in large-scale protein identifications

by mass spectrometry. Nat. Methods 2007, 4, 207–214.

[41] Chalkley, R. J., Baker, P. R., Hansen, K. C., Medzihradszky,

K. F. et al. Comprehensive analysis of a multidimensional

liquid chromatography mass spectrometry dataset

acquired on a quadrupole selecting, quadrupole collision

cell, time-of-flight mass spectrometer: I.How much of the

data is theoretically interpretable by search engines? Mol.

Cell. Proteomics 2005, 4, 1189–1193.

[42] Mann, M., Wilm, M., Error-tolerant identification of

peptides in sequence databases by peptide sequence tags.

Anal. Chem. 1994, 66, 4390–4399.

[43] Dancik, V., Addona, T. A., Clauser, K. R., Vath, J. E. et al. De

novo peptide sequencing via tandem mass spectrometry.

J. Comput. Biol. 1999, 6, 327–342.

[44] Fernandez-de-Cossio, J., Gonzalez, J., Satomi, Y., Shima,

T. et al. Automated interpretation of low-energy collision-

induced dissociation spectra by SeqMS, a software aid for

de novo sequencing by tandem mass spectrometry. Elec-

trophoresis 2000, 21, 1694–1699.

[45] Ma, B., Zhang, K., Hendrie, C., Liang, C. et al. PEAKS:

powerful software for peptide de novo sequencing by

tandem mass spectrometry. Rapid Commun. Mass Spec-

trom. 2003, 17, 2337–2342.

[46] Johnson, R. S., Taylor, J. A., Searching sequence data-

bases via de novo peptide sequencing by tandem mass

spectrometry. Mol. Biotechnol. 2002, 22, 301–315.

[47] Frank, A., Pevzner, P., PepNovo: de novo peptide sequen-

cing via probabilistic network modeling. Anal. Chem. 2005,

77, 964–973.

[48] Searle, B. C., Dasari, S., Turner, M., Reddy, A. P. et al. High-

throughput identification of proteins and unanticipated

sequence modifications using a mass-based alignment

algorithm for MS/MS de novo sequencing results. Anal.

Chem. 2004, 76, 2220–2230.

[49] Savitski, M. M., Nielsen, M. L., Kjeldsen, F., Zubarev, R. A.

et al. Proteomics-grade de novo sequencing approach.

J. Proteome Res. 2005, 4, 2348–2354.

[50] Sunyaev, S., Liska, A. J., Golod, A., Shevchenko, A. et al.

MultiTag: multiple error-tolerant sequence tag search for

the sequence-similarity identification of proteins by mass

spectrometry. Anal. Chem. 2003, 75, 1307–1315.

[51] Tabb, D. L., Saraf, A., Yates, J. R., GutenTag: high-

throughput sequence tagging via an empirically derived

fragmentation model. Anal. Chem. 2003, 75, 6415–6421.

[52] Tabb, D. L., Ma, Z., Martin, D. B., Ham, A. L. et al. DirecTag:

accurate sequence tags from peptide MS/MS through

statistical scoring. J. Proteome Res. 2008, 7, 3838–3846.

[53] Hernandez, P., Gras, R., Frey, J., Appel, R. D. et al. Popitam:

towards new heuristic strategies to improve protein iden-

tification from tandem mass spectrometry data. Proteo-

mics 2003, 3, 870–878.

[54] Searle, B. C., Dasari, S., Wilmarth, P. A., Turner, M. et al.

Identification of protein modifications using MS/MS de

novo sequencing and the OpenSea alignment algorithm.

J. Proteome Res. 2005, 4, 546–554.

[55] Na, S., Jeong, J., Park, H., Lee, K. et al. Unrestrictive

identification of multiple post-translational modifications

from tandem mass spectrometry using an error-tolerant

algorithm based on an extended sequence tag approach.

Mol. Cell. Proteomics 2008, 7, 2452–2463.

[56] Tanner, S., Shu, H., Frank, A., Wang, L. et al. InsPecT:

identification of posttranslationally modified peptides from

tandem mass spectra. Anal. Chem. 2005, 77, 4626–4639.

[57] Liu, J., Erassov, A., Halina, P., Canete, M. et al. Sequential

interval motif search: unrestricted database surveys of global

MS/MS data sets for detection of putative post-translational

modifications. Anal. Chem. 2008, 80, 7846–7854.

[58] Pevzner, P. A., Mulyukov, Z., Dancik, V., Tang, C. L. et al.

Efficiency of database search for identification of mutated

and modified proteins via mass spectrometry. Genome

Res. 2001, 11, 290–299.



[59] Creasy, D. M., Cottrell, J. S., Error tolerant searching of

uninterpreted tandem mass spectrometry data. Proteo-

mics 2002, 2, 1426–1434.

[60] Craig, R., Beavis, R. C., A method for reducing the time

required to match protein sequences with tandem mass

spectra. Rapid Commun. Mass Spectrom. 2003, 17,

2310–2316.

[61] Savitski, M. M., Nielsen, M. L., Zubarev, R. A., ModifiComb,

a new proteomic tool for mapping substoichiometric post-

translational modifications, finding novel types of modifi-

cations, and fingerprinting complex protein mixtures. Mol.

Cell. Proteomics 2006, 5, 935–948.

[62] Falkner, J. A., Falkner, J. W., Yocum, A. K., Andrews, P. C.

et al. A spectral clustering approach to MS/MS identifica-

tion of post-translational modifications. J. Proteome Res.

2008, 7, 4614–4622.

[63] Ahrne, E., Masselot, A., Binz, P., M .uller, M. et al. A simple

workflow to increase MS2 identification rate by subse-

quent spectral library search. Proteomics 2009, 9,

1731–1736.

[64] Quandt, A., Masselot, A., Hernandez, P., Hernandez, C.

et al. SwissPIT: An workflow-based platform for analyzing

tandem-MS spectra using the Grid. Proteomics 2009, 9,

2648–2655.

[65] Hansen, B. T., Davey, S. W., Ham, A. L., Liebler, D. C. et al.

P-Mod: an algorithm and software to map modifications to

peptide sequences using tandem MS data. J. Proteome

Res. 2005, 4, 358–368.

[66] Tsur, D., Tanner, S., Zandi, E., Bafna, V. et al. Identification

of post-translational modifications by blind search of mass

spectra. Nat. Biotechnol. 2005, 23, 1562–1567.

[67] Pevzner, P. A., Dancik, V., Tang, C. L., Mutation-tolerant

protein identification by mass spectrometry. J. Comput.

Biol. 2000, 7, 777–787.

[68] Havilio, M., Wool, A., Large-scale unrestricted identifica-

tion of post-translation modifications using tandem mass

spectrometry. Anal. Chem. 2007, 79, 1362–1368.

[69] Baumgartner, C., Rejtar, T., Kullolli, M., Akella, L. M. et al.

SeMoP: a new computational strategy for the unrestricted

search for modified peptides using LC-MS/MS data.

J. Proteome Res. 2008, 7, 4199–4208.

[70] Tang, W. H., Halpern, B. R., Shilov, I. V., Seymour, S. L.

et al. Discovering known and unanticipated protein modi-

fications using MS/MS database searching. Anal. Chem.

2005, 77, 3931–3946.

[71] Chamrad, D. C., Korting, G., Sch .afer, H., Stephan, C.

et al. Gaining knowledge from previously unexplained

spectra-application of the PTM-Explorer software to detect

PTM in HUPO BPP MS/MS data. Proteomics 2006, 6,

5048–5058.

[72] Chalkley, R. J., Baker, P. R., Medzihradszky, K. F., Lynn,

A. J. et al. In-depth analysis of tandem mass spectrometry

data from disparate instrument types. Mol. Cell. Proteo-

mics 2008, 7, 2386–2398.

[73] Tanner, S., Payne, S. H., Dasari, S., Shen, Z. et al. Accurate

annotation of peptide modifications through unrestrictive

database search. J. Proteome Res. 2008, 7, 170–181.

[74] Bern, M., Goldberg, D., Improved ranking functions for

protein and modification-site identifications. J. Comput.

Biol. 2008, 15, 705–719.

[75] Feng, J., Naiman, D. Q., Cooper, B., Probability model for

assessing proteins assembled from peptide sequences

inferred from tandem mass spectrometry data. Anal.

Chem. 2007, 79, 3901–3911.

[76] Haas, W., Faherty, B. K., Gerber, S. A., Elias, J. E. et al.

Optimization and use of peptide mass measurement

accuracy in shotgun Proteomics. Mol. Cell. Proteomics

2006, 5, 1326–1337.

[77] Spengler, B., De novo sequencing, peptide composition

analysis, and composition-based sequencing: a new

strategy employing accurate mass determination by four-

ier transform ion cyclotron resonance mass spectrometry.

J. Am. Soc. Mass Spectrom. 2004, 15, 703–714.

[78] Scherl, A., Shaffer, S. A., Taylor, G. K., Hernandez, P. et al.

On the benefits of acquiring peptide fragment ions at high

measured mass accuracy. J. Am. Soc. Mass Spectrom.

2008, 19, 891–901.

[79] Liu, T., Belov, M. E., Jaitly, N., Qian, W. et al. Accurate

mass measurements in Proteomics. Chem. Rev. 2007, 107,

3621–3653.

[80] Frank, A. M., Bandeira, N., Shen, Z., Tanner, S. et al.

Clustering millions of tandem mass spectra. J. Proteome

Res. 2008, 7, 113–122.

[81] Bern, M., Goldberg, D., McDonald, W. H., Yates, J. R. et al.

Automatic quality assessment of peptide tandem mass

spectra. Bioinformatics 2004, 20, i49–i54.

[82] Tanner, S., Pevzner, P. A., Bafna, V., Unrestrictive identifi-

cation of post-translational modifications through peptide

mass spectrometry. Nat. Protoc. 2006, 1, 67–72.

[83] Pandey, A., Podtelejnikov, A. V., Blagoev, B., Bustelo, X. R.

et al. Analysis of receptor signaling pathways by mass

spectrometry: identification of vav-2 as a substrate of the

epidermal and platelet-derived growth factor receptors.

Proc. Natl. Acad. Sci. USA 2000, 97, 179–184.

[84] Peng, J., Gygi, S. P., Proteomics: the move to mixtures.

J. Mass Spectrom. 2001, 36, 1083–1091.

[85] Bodenmiller, B., Mueller, L. N., Mueller, M., Domon, B.

et al. Reproducible isolation of distinct, overlapping

segments of the phosphoproteome. Nat. Methods 2007, 4,

231–237.

[86] Seo, J., Jeong, J., Kim, Y. M., Hwang, N. et al. Strategy

for comprehensive identification of post-translational

modifications in cellular proteins, including low

abundant modifications: application to glyceraldehyde-

3-phosphate dehydrogenase. J. Proteome Res. 2008, 7,

587–602.

[87] Schmidt, A., Gehlenborg, N., Bodenmiller, B., Mueller,

L. N. et al. An integrated, directed mass spectrometric

approach for in-depth characterization of complex peptide

mixtures. Mol. Cell. Proteomics 2008, 7, 2138–2150.

[88] Carapito, C., Klemm, C., Aebersold, R., Domon, B. et al.

Systematic LC-MS analysis of labile post-translational

modifications in complex mixtures. J. Proteome Res. 2009,

8, 2608–2614.


Proteomics 2010, 10, 671–686 685

[89] Barsnes, H., Mikalsen, S. O, Eidhammer, I., Blind search for

post-translational modifications and amino acid substitu-

tions using peptide mass fingerprints from two proteases.

BMC Res. Notes 2008, 1, 130.

[90] Bodenmiller, B., Mueller, L. N., Pedrioli, P. G. A., Pflieger,

D. et al. An integrated chemical, mass spectrometric and

computational strategy for (quantitative) phosphoPro-

teomics: application to Drosophila melanogaster Kc167

cells. Mol. Biosyst. 2007, 3, 275–286.

[91] Gruhler, A., Olsen, J. V., Mohammed, S., Mortensen, P.

et al. Quantitative phosphoProteomics applied to the yeast

pheromone signaling pathway. Mol. Cell. Proteomics

2005, 4, 310–327.

[92] Zhang, Z., McElvain, J. S., De novo peptide sequencing by

two-dimensional fragment correlation mass spectrometry.

Anal. Chem. 2000, 72, 2337–2350.

[93] Olsen, J. V., Mann, M., Improved peptide identification in

Proteomics by two consecutive stages of mass spectro-

metric fragmentation. Proc. Natl. Acad. Sci. USA 2004, 101,

13417–13422.

[94] Ulintz, P. J., Bodenmiller, B., Andrews, P. C., Aebersold, R.

et al. Investigating MS2/MS3 matching statistics: a model

for coupling consecutive stage mass spectrometry data for

increased peptide identification confidence. Mol. Cell.

Proteomics 2008, 7, 71–87.

[95] Kelleher, N. L., Zubarev, R. A., Bush, K., Furie, B. et al.

Localization of labile posttranslational modifications by

electron capture dissociation: the case of gamma-carbox-

yglutamic acid. Anal. Chem. 1999, 71, 4250–4253.

[96] Syka, J. E. P., Coon, J. J., Schroeder, M. J., Shabanowitz, J.

et al. Peptide and protein sequence analysis by electron

transfer dissociation mass spectrometry. Proc. Natl. Acad.

Sci. USA 2004, 101, 9528–9533.

[97] Mikesh, L. M., Ueberheide, B., Chi, A., Coon, J. J. et al. The

utility of ETD mass spectrometry in proteomic analysis.

Biochim. Biophys. Acta 2006, 1764, 1811–1822.

[98] Olsen, J. V., Macek, B., Lange, O., Makarov, A. et al. Higher-

energy C-trap dissociation for peptide modification analy-

sis. Nat. Methods 2007, 4, 709–712.

[99] Wilmarth, P. A., Tanner, S., Dasari, S., Nagalla, S. R. et al.

Age-related changes in human crystallins determined from

comparative analysis of post-translational modifications in

young and aged lens: does deamidation contribute to

crystallin insolubility? J. Proteome Res. 2006, 5,

2554–2566.

[100] Anderson, L., Hunter, C. L., Quantitative mass spec-

trometric multiple reaction monitoring assays for

major plasma proteins. Mol. Cell. Proteomics 2006, 5,

573–588.

[101] Plewczynski, D., Tkacz, A., Wyrwicz, L. S., Rychlewski, L.

et al. AutoMotif Server for prediction of phosphorylation

sites in proteins using support vector machine: 2007

update. J. Mol. Model 2008, 14, 69–76.

[102] Kapp, E. A., Sch .utz, F., Connolly, L. M., Chakel, J. A. et al.

An evaluation, comparison, and accurate benchmarking of

several publicly available MS/MS search algorithms:

sensitivity and specificity analysis. Proteomics 2005, 5,

3475–3490.



REVIEW Unrestricted identiﬁcation of modiﬁed proteins...

Documents

Transcript of REVIEW Unrestricted identiﬁcation of modiﬁed proteins...