Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei...
-
Upload
angelica-leonard -
Category
Documents
-
view
216 -
download
0
description
Transcript of Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei...
Tag-based Blind Identification of PTMs with Point Process Model
1Chunmei Liu, 2Bo Yan, 1Yinglei Song, 2Ying Xu, 1Liming Cai
1Dept. of Computer Science 2Dept. of Biochemistry and Molecular Biology, University of Georgia, USA
2
Tandem mass spectra of peptides
e.g., MS/MS of GLSDGEWQQVLNVWGK (www.ionsource.com)
3
Tandem mass spectra of peptides
(a tutorial from www.cmb.usc.edu)
Massb1 + Massy8 = Masstotal
4
Peptide sequencing De novo sequencing: directly infer the target peptide from its MS
data
[Fernandez et al 1992; Dancik et al 1999; Chen et al 2001; Searle et al 2001; Ma et al 2003, Liu et al 2006]
sensitive to MS data; noises; missing peaks; and difficult.
DB search based: compare the target MS with theoretical MS in a peptide database
[Eng et al, 1994; Perkins et al 1999]
slow; target may not be in the database; or modified after translation
5
Post-translational modification (PTM)
6
Identifying PTMs in peptide sequencing
Assume a limited set of modification types and model them as pseudo amino acids[Yates et al 1995; Wilkins et al 1999; Tanner et al 2005]
regular sequencing tools can applymay erroneously processing PTMs of unknown types
Blind identification (unlimited modification types) spectral alignment based (difficult) [Pevnzer et al 2000, Tsur
et al 2005, Yan et al 2006]de novo sequencing dependent [Han et al 2005]
7
This work
DB search based (point process model, Yan et al 2006) PTM identification Yan et al 2006 is comparable to Tsur et al 2005
Both are the best blind PTM identification programsYan et al 2006 faster, hits of homologs
Peptide tag-based filtering of database
Graph-theoretic approach to generate tags
8
Our approach details
Input: an experimental spectrum Output: a peptide sequence and possible PTMs Steps:
Construct an extended spectrum graph to find all maximum weighted anti-symmetric paths and
select tags as the paths
Construct a DFA from the tags to filter the peptide database to obtain candidate peptides
Apply point process model to the candidates to identify the peptide and potential PTMs by maximizing spectra
alignment score
9
Spectrum graph
1 2 3 2n-1 2n
b y b b y
A tandem mass spectrum
source
sink
Inte
nsity
m/z
1 3source
i 2n-1sink
is the mass of a single amino acids.
De novo sequencing corresponds to finding a longest directed anti-symmetric path from source to sink[Dancik et al 1999, etc.]
2 4 2i 2n
10
},....,,{ 221 kxxx
12 −+ikx2x
Assume a MS/MS spectrum S of a peptide P be a set of mass peaks . if ; parent mass is M.
……
1x……
ix……
12 −kx kx2 M0
ij xx −If is a mass of a single amino acid, connect the corresponding vertices with directed edges
Connect each pair of complementary verticesand with a non-directed edge.
ix12 −+ikx
ji xx > ji >
Extended spectrum graph
[Liu et al 2006]
11
Extended spectrum graph
00 471
Mass/Charge
Inte
nsi
ty
(a) (b)
100 200 300 400
71
113
202 269
358
400
71 113 202 269 358 400
A M R L
AMRL
parent mass=471
Peptide: AMRL/LRMA
12
Tag selection for the target peptide
Tag: a short sequence of amino acids
Previous work: PepNovo [Frank et al 2005]
apply de novo sequence algorithms first, and identify tags from the sequenced peptide
Advantage: effectiveDisadvantages: the present of noises, missing peaks, and PTMs make it hard to improve the effectiveness; slow
13
Tag selection for the target peptide
In this work:construct an extended spectrum graph (mixed graph) from the target spectrum
tree-decompose the graph
dynamic programming to find all maximum weighted anti-symmetric paths
advantages: fast and effective, tolerating noises and missing peaks
14
Graph Tree Decomposition
a ab c c
cde
hf fa g gb c
d ef
ha g
fa gfa gf
Tree decompositionbag
a b c d e
a c f g h
Graph
15
Properties of tree decomposition
1. Each vertex is contained in at least one bag
a ab c c
cde
hf fa g g
b c
d ef
ha
fa gfa gf
g
16
2. For any edge {g, f}: there is a bag containing both g and f
1. Each vertex is contained in at least one bag
a ab c c
cde
hf g
b c
d ef
ha g
fa gfa gf
Properties of tree decomposition
17
3. For every vertex c: the bags that contain c form a connected subtree
a ab c c
cde
hf g
b c
d ef
ha g
fa gfa gf
2. For any edge {g, f}: there is a bag containing both g and f
1. Each vertex is contained in at least one bag
Properties of tree decomposition
18
Tree width of a tree decomposition:
Tree width of a graph: minimum width over all tree decompositions of the graph
a ab c c
cde
hf fa g g
1||max −∈ iIi bagTree width = 2
b c
d ef
ha g
a b c d e
a c f g h
Tree width = 4
Tree Width
19
Internal tree bags in a tree decomposition are separators of the graph
a ab c c
cde
hf fa g g
b c
d ef
ha g
Tree bags are separators
20
Tree bags in a tree decomposition are separators of the graph
a ab c c
cde
hf fa g g
b c
d ef
ha g
This allows efficient dynamic programming
b
d e
hg
Tree bags are separators
21
A table is maintained for each bag
1 2
3
4
6
5
Dynamic programming
Compute tables bottom up
Each table contains partial optimal solutions; the root table contains the optimal one
Time complexity: O(6tn2)
22
Dynamic programming (cont’s)
a b c V L
a d b 1c 2c V L b e c 1c 2c V L
1 1 1 1 2
111
110
0 0 1 31 − 1 4
1 0 11 0 1
01
−−
10
3−
iX
jX kX
1c 2c
… ……
bottom-up…
…
……
……
abc
adb bec
……
iX
jX kX
23
Score scheme and reliability of sequence tags Assign the score scheme [Dancik et all 1999] as weights to the
edges in spectrum graphs
Overall reliability of a tag t i = w1r1(ti) + w2r2(ti) r1(ti) - reliability computed from ti’s edge normalized weightsr2(ti) - reliability computed autocorrelation score [Liu et al, 2005]
Refer to the paper for details
24
PTM identification with point process blind search
Find a set of PTMs to maximize the spectral alignment
Can identify all possible PTMs through one round of cross-correlation calculation
Computation time is independent of the number of PTMs
25
PTM identification with point process blind search Treat a spectrum and the theoretical spectrum of a candidate
peptide as one point process:
where {ti } is a set of mass locations with N peaks, and δ is the Kronecker delta function:
Assume there is K PTMs, the {ti } can be clustered into K+1 groups:
∑=
−=N
iitttx
1
)()( δ
∑ ∑∑= ==
−==K
k
N
i
ki
K
kk
k
tttxtx0 1
)(
0
)()()( δ
26
PTM identification with point process blind search When a PTM happens, a shift occurs to xk(t) to produce yk(t)
Use C[.] to denote the total number of non-zero values in a point process:
∑=
−Δ+=Δ+=kN
i
kikkkk tttxty
1
)( )()()( δ
[ ] [ ] ∑=
==K
kkkk NtxCNtxC
0
)(,)(
[ ])()()( ττ −≡ tytxCcxy
27
For K=1, ∆ represents the mass of a possible PTM, we report the top candidate with a ∆, and
with the maximum tolerancePWPW ≤Δ−− || exp
)()0( Δ+ xyxy cc
PTM identification with point process blind search
28
Evaluations
Datasets 2657 annotated yeast ion trap tandem mass spectra fro
m OPD (Prince et al, 2004) having relatively low mass resolutions
2620 modified spectra with one artificially added one PTM to each spectrum (Yan et al, 2006)
Experiments Sequence tag generation Database search via DFA based model Blind PTM identification
29
Performance in tag selectionTag
lengthAlgorithm Top 1 Top 3 Top 5 Top 10 Top 25 Time(s)
w/oPTM
3 OursPepnovo
75.875.8
89.190.1
94.694.6
96.996.8
98.198.8
0.333.62
4 OursPepnovo
65.365.5
80.581.0
88.786.6
93.692.3
96.495.3
0.343.69
5 OursPepnovo
56.458.4
72.871.3
78.377.6
85.184.0
89.888.9
0.333.83
6 OursPepnovo
50.249.7
62.361.5
66.967.8
76.675.0
82.481.8
0.344.27
withPTM
3 OursPepnovo
68.162.8
84.883.7
90.389.7
94.884.9
97.197.8
0.323.59
4 OursPepnovo
53.551.1
71.271.7
78.679.3
84.885.8
90.091.4
0.323.64
Columns: percentages of spectra that have at least one correct tag in top 1, 3, 5, 10, 25. Comparisons based on the sequencing results by SEQUEST [Eng et al 1994]
30
Performance in tag selection (cont’d)
Time complexity of the tag selection depends on the tree width tof spectrum graphs: O(6tn2)
About 90% of such graphs have tree width not exceeding 6
More than 10 times faster than PepNovo [Frank et al 2005]
31
Database search for PTM identification
Construct a DFA from the selected sequence tags and use it to filter a peptide database
Only small portion of peptides will remain
Point process model for PTM identification are applied to identify the peptide and potential PTMs
32
Performance in PTM identification
Tag length
Top 1 Top 2 Top 3 Top 4 Top 5 Filtration Ratio
T(s)
3 76.69 86.01 89.29 90.70 91.62 0.0167 263
4 74.98 80.77 81.71 82.17 84.40 0.0014 34
W/O Filtration
60.38 72.33 76.64 79.16 81.17 - 3843
Columns: cumulative percentages of search results capturingthe target peptides exactly in Top i; T is the total time for all 2620 experimental spectra. Comparisons with Yan et al 2006 that does not employ filtration.
33
Summary
A new graph-theoretic approach for peptide tag selectioneffective and efficient
In combine with point process model to sequence peptide and identify PTMs
effective and efficient
More tests are needed (e.g. two PTMs)
Tree decomposition based approaches have not been fully exploited (e.g., improving tag selection effectiveness)
34
Acknowledgement
CS@UGAChunmei LiuYinglei Song
BMB@UGABo YanYing Xu
NSF NIH