Sydney Bioinformatics Research Symposium 2012 posterbook
-
Upload
australian-bioinformatics-network -
Category
Technology
-
view
761 -
download
2
description
Transcript of Sydney Bioinformatics Research Symposium 2012 posterbook
LEARNING & INTEGRATING PETRI NET MODELS OF BIOLOGICAL SYSTEMS
Ashwin Srinivasan 1, Michael Bain 2, Sandeep Kaur2,3 and Mark Temple4
1 Indrapastha Institute of Information Technology, New Delhi, India, 2 University of New South Wales, Sydney,
Australia, 3Garvan Institute, Sydney, Australia and 4University of Western Sydney, Sydney, Australia
1. Introduction
Qualitative modelling (QM) approaches for biological applications have some limitations: spurious behaviours; lack of concurrency; not easily extended to continuous or stochastic representations [1]. Petri nets (PN) are a QM approach that avoids these limitations. PNs have a strong formal basis and have been widely used in modelling biological systems. However, relatively little work exists on learning such models from biological data. We introduced a definite clause representation for PNs called Guarded Transition Systems and show that a known combinatorial algorithm can be formulated as a search through a lattice of clauses, enabling the use of ILP to learn PNs [2]. Advantages include: efficient search of ILP hypothesis space, compared to previous algorithm; extending representation to identify regulatory and metabolic models in the same modelling framework; using existing networks in background knowledge to learn hierarchical models.
2. A Petri net example
[1] Srinivasan and King (2008) “Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming”. Journal of Machine Learning Research, 9:1475–1533. [2] Srinivasan and Bain (2012) "Knowledge-Guided Identification of Petri Net Models of Large Biological Systems", pp. 317-331, LNAI 7207, Springer. [3] Kaur (2012) "Phenotype Prediction with Models of Cellular Systems". Honours Thesis, School of Computer Science and Engineering, UNSW. [4] Temple, Perrone and Dawes (2005) "Complex cellular responses to reactive oxygen species". TRENDS in Cell Biology, 15(6):219-326.
3. Petri net reconstruction using ILP 4. Model integration: phenotype prediction
A learned PN model for yeast pheromone response. Uses several generic components of signalling pathways encoded as guarded transitions.
4. References
A Petri net representing construction of water (transition – bar) from reactants (places – circles).
Initial marking.
Final marking.
PN model [3] of yeast response to H2O2 [4] integrated deletant phenotypes, transcriptomics and proteomics data, highlighting potential pathways.
Gene regulatory networks in heart development!
Bouveret R., Doan T., Ramialison M., de Jong D., Schonrock N., Chapman G., Chen C.M., Bhattacharya S., Dunwoodie S.L. and Harvey R.P.
The Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW 2010, Australia
ABSTRACT!The heart is the first organ to form, and the ongoing viability and growth of the embryo depends vitally on its evolving functional output. To achieve this an organism requires a remarkable degree of spatio-temporal
gene expression control orchestrated largely by a conserved gene regulatory network. Understanding the structure and dynamics of this complex web of biological interactions that specifies the identity and function
of cells and organs is essential for understanding normal development and is a prerequisite to exploring complex human diseases. In order to build the cardiac gene regulatory network at a Systems Biology level we
have adapted the DNA adenine methyltransferase identification (DamID) method to define target genes of key transcription factors that play a major role in this process. We have applied DamID to a differentiated
atrial cardiac cell line where we have been able to: 1) analyze and compare the binding sites and target genes of >15 key cardiac transcription factors; 2) identify new regulatory co-factors that play essential roles in
cardiac development; and 3) study the effect of disease-causing mutations on transcription factor DNA-binding activity and specificity at the genome level. We now aim to apply DamID to different cardiovascular
progenitor cell populations isolated from embryonic stem cell cultures. In conclusion, we have used DamID to elucidate the dynamic interaction between specific transcription factors and gene cis-regulatory elements
to explore the regulatory logic of cardiac development and disease.
"DamID method (van Steensel et al., 2001; Vogel et al., 2007)!!We adapted the DNA adenine methyltransferase identification (DamID) technique because:
ü It does not rely an specific antibodies for immuno-precipitation (isoforms, mutants, family members etc)
ü it is very sensitive and can be applied to small amounts of material
• HL-1 cardiomyocyte cells (Claycomb at al., 1998)
• Simple and homogenous system
GATC GA m TC GATC GA m TC
DpnI digest
GATC GA m TC
microarrays
GA m TC
Dam
N-term C-term Nkx2-5
LM-PCR amplification
Dam
N-term C-term
Dam
controls
Nkx2-5 mutants
Dam
N-term C-term
Dam
N-term C-term Dam
Cbx1
Identification of Nkx2-5 target genes
NKX2-5 is a homeodomain factor that sits at the very top of the cardiac regulatory hierarchy.
It is to date the most commonly mutated single gene in congenital heart diseases.
We performed DamID using an Affymetrix whole-chromosome microarray
ü Nkx2-5 peaks are significantly enriched in proximal (5’ and 3’) and intragenic
regions
ü The density of Nkx2-5 peaks is highest immediately upstream of the TSS
ü The NKE is the most over-represented motif in 5’ proximal regions
ü Nkx2-5 peaks are conserved only amongst closely related species
We also performed DamID using an Affymetrix promoter microarray
ü Nkx2-5 target genes are active in cardiomyocytes and cardiac tissue
ü Nkx2-5 target genes are enriched in cardiac GO terms (heart development, muscle
contraction and cell proliferation, etc) (GREAT: McLean et al., 2010)
Nkx2-5 WT and Nkx2-5 ∆HD
ETS transcription factors have essential functions in heart development
ETS transcription factors at the heart of the cardiac Gene Regulatory Network
1288
407
669
Nkx2-5
WT (a) Nkx2-5
!"#$%c)Pe
ak
s
b
Nkx2
-5 w
ild-t
yp
e
Hoechst anti-V5
!"#$%&'()*
HL
-1 c
ard
iom
yo
cyte
s
0.00
0.01
0.04
0.05
Norm
alis
ed lucifera
se u
nits
F(1
)Nkx2
-5 +
F(2
)Nkx2
-5
F(1
)Nkx2
-5 +
F(2
)GS
T
F(1
)GS
T +
F(2
)Nkx2
-5
F(1
)Nkx2
-5 !"#
+ F
(2)N
kx2
-5
F(1
)Nkx2
-5 !"#
+ F
(2)G
ST
F(1
)GS
T +
F(2
)Nkx2
-5
F(1
)Nkx2
-5 !"#
+ F
(2)N
kx2
-5 !"#
F(1
)Nkx2
-5 !"#
+ F
(2)G
ST
F(1
)GS
T +
F(2
)Nkx2
-5 !"#
F(1
)GS
T +
F(2
)GS
T
(po
sitiv
e c
on
tro
l)
********
**
***
*
**
distal
5!
intragenic
3!
distal
47%36%
12%
5%
Nkx2-5 peaks
38%53%
6%3%
probes
-10
kb
-
TS
S -
TE
S -
+5
kb
-
Nkx2-5 DamID results promoter array
Trawler Weeder
#1
#23!
intragenic 5!
distal
Nkx2-5
TRANSFAC M00240
Nkx2-5 DamID results WC array
-10000 -5000 0 5000 10000
0e
+0
01
e-0
52
e-0
53
e-0
54
e-0
5
Nkx2-5 WCA Peaks distribution in DAM-ID epxeriments
distance from TSS (bp)
De
nsity
100
75
50
25
0
% e
mbry
os
- - - - - - -
- - - - -
- - - - - - -
- - - - -
3 6
66 6 6
2.6 2.6
231 4570871168610459386
elk1 Y167A Elk1 elk4
control-MO
elk1-MO
elk4-MO
mRNA
n
uninjected elk1-MO elk4-MO
WT
Curly tail
Unlooped heart
Dead
Nkx
2-5
+ Nkx
2-5
Nkx
2-5
+ GST
GST +
Nkx
2-5
Nkx
2-5
+ Elk1
Nkx
2-5
+ GST
GST +
Elk1
Nkx
2-5
+ Elk1
Y15
8A
Nkx
2-5
+ GST
GST+
Elk1
Y15
8A
Nkx
2-5
dHD +
Elk1
Nkx
2-5
dHD +
GST
GST +
Elk1
Nkx
2-5
+ Elk4
Nkx
2-5
+ N-G
ST
N-G
ST +
Elk4
Nkx
2-5
dHD +
Elk4
Nkx
2-5
dHD +
GST
GST +
Elk4
SRF +
Elk1
SRF +
GST
GST +
Elk1
SRF +
Elk1
Y15
8A
SRF +
GST
GST+E
lk1
Y15
8A
0.000
0.005
0.010
0.030
0.035
0.040 !!!!
!!!!
!!!!
!!!!
!!!!
!!!!
!!
!!
!!!!
!!
!!!!
!!!!
!!!!
!!!!
ns
!
Wei et a
l., 2010
Tra
wle
rW
eeder
Nkx2-5
WT
!"#$%&'()*
CV-1 cells
Ho
ech
st
exog
en
ou
s N
kx2
-5e
xo
ge
no
us E
lk1
Merg
e
Hoechst anti-V5 anti-HA merge
Nkx2
-5 w
ild-t
yp
e!"#$%&'()*
co
-exp
ressio
n
CV
-1 c
ells
Elk
?
Nkx2
-4
Nkx2
-5G
AB
PA
Elk
1
a
b
c
? = Elk1/4
Elk1
SRF
Elk4
834
602
303
1459
92
187
78
pe
ak
s
#1
#1
Elk
1E
lk4
SR
F
#1Wei et al., 2010
Wei et al., 2010
Badis et al., 2009
(Weeder: Pavesi et al., 2004; Trawler: Ettwiller et al., 2007)
Funding: National Health and Medical Research Council (573705) and
Australian Research Council (DP0988507)
The homeodomain (HD) of Nkx2-5 is not
essential for DNA-binding
ü HD is not essential for
homodimerisation
ü The Nkx2-5 WT/Nkx2-5 ∆HD dimer
binds DNA directly
ü Nkx2-5 ∆HD binds a new set of target
genes through protein-protein
interactions
Rluc
-PCA
FRET
IF m
icro
scop
y
Nkx2-5 interacts with ETS factors
ü ETS factors carry Nkx2-5 ∆HD into the
nucleus
ü ETS factors interact with Nkx2-5 WT and
Nkx2-5 ∆HD independently of SRF
ü Elk factors are essential for heart
development in zebrafish
We performed DamID with T-box proteins,
Gata4, Elk factors and SRF
ü ETS factors are ubiquitous factors that
have general functions
ü Elk factors are major contributors in the
cardiac Gene Regulatory Network
ü Elk factors co-regulate Nkx2-5 target
genes extensively
NGS
Computational analysis of large rearranged immunoglobulin gene sequence sets
• Gene$c varia$on in the immunoglobulin (Ig) locus impacts the immune response
• However li:le is known about this varia$on as major genomics projects (HGP, HapMap, 1000 Genomes) have bypassed the Ig locus
• Ultra-‐deep sequencing of rearranged Ig genes obtained from a subject’s B-‐lymphocytes provides a window on this varia$on
Clonal-‐Relate A new distance metric and clustering algorithm to iden$fy clonally-‐related Ig sequences in 454 sequencing datasets
Zhiliang Chen1, Marie Kidd2, Katherine Jackson2, Yan Wang2, Mike Bain1, Andrew Collins2, Bruno Gaëta1 1School of Computer Science and Engineering, UNSW 2School of Biotechnology and Biomolecular Sciences, UNSW
Blood sample Rearranged IGH gene sequences (VDJ)
Mul$plex PCR Sequencing (454)
Our focus is on the development of bioinforma$cs methods for analysing rearranged Ig sequences to understand immunoglobulin diversity at the germline level
Genotyping For each sequence set obtained from a subject, find the combina$on of alleles most likely to generate the observed data, through a combina$on of sequence alignment (dra_ genotyping) and applica$on of a maximum likelihood model based on iHMMune-‐align
4. An Automated Method for Genotyping the Human Immunoglobulin HeavyChain Variable Region Locus
the maximum likelihood genotype is defined as:
argmaxG
P (S|G) (4.1)
where P (S|G) is the probability of the sequence set S given a genotype G.
In each individual, the gene composition will normally be homozygous (one
allele) or heterozygous (two alleles), however in cases of gene duplication that
have not been recognized by the WHO/IUIS/IMGT immunoglobulin gene nomen-
clature committee, there may be three or even four alleles of the same gene
present. For each IGHV gene, the potential genotypes are the subsets (G =
{g1, g2, g3, ..., gn}, 1 n 4) of up to four alleles of the allele set identified in the
draft genotyping step.
Therefore P (si|G) can be estimated as:
P (si|G) =X
gn
2G
P (si|gn)P (gn|G) (4.2)
There are currently no data describing allele specific rearrangement frequen-
cies, so we assume that the possibility of each allele appearing in a genotype is
equal,
P (si|G) =
Pgn
2G P (si|gn)n
(4.3)
Therefore according to 4.1,4.2 and 4.3,
argmaxG
P (S|G) = argmaxG
Y
si
2S
Pgn
2G P (si|gn)n
(4.4)
100
4. An Automated Method for Genotyping the Human Immunoglobulin HeavyChain Variable Region Locus
Figure 4.3: Number of allele di↵erences between pairs of individu-als based on high confidence automated IGHV genotyping. Each cellof the matrix contains the number of alleles predicted in the individual of thecorresponding row but not in the individual of the corresponding column. Cellscomparing twin genotypes are shaded.
122
Genotyping method evalua$on: number of allelic differences observed between pairs of samples. Purple squares correspond to samples from iden$cal twins or different $me points from the same individual pre-‐ and post-‐immune challenge
IgPdb A database of new Ig polymorphisms iden$fied using iHMMune-‐align
h:p://www.cse.unsw.edu.au/ihmmune/IgPdb
Haplotyping Use the associa$ons between alleles observed in rearranged sequences to infer the likely IgH haplotype of the subject. This requires the subject to be heterozygous at the IGHJ4 or IGHJ6 loci.
Sec$on of the phased IGHV haplotype of a subject obtained by applying mul$nomial logis$c regression to classify observed allele associa$ons. n one allele present on the chromosome n two alleles present (likely duplica$on) allele not present (likely dele$on)
n presence/absence of allele cannot be confirmed
5.Multin
omial
Logistic
Regression
fortheIdentifi
cationof
Immunoglob
ulin
Hap
lotypes
Figure 5.6: The IGHV Haplotypes of the Nine Individuals.(II) Pink rectangles indicate an allele of the geneis at present. Blue rectangles indicate that the gene at present has two alleles on the same chromosome. Yellowrectangles indicate that the presence of the allele on the chromosome cannot be confirmed. Rectangles withoutcolour filling indicated a deletion of the gene on the chromosome.
160
iHMMune-‐align High-‐accuracy HMM-‐based algorithm for iden$fying germline genes and muta$on events in rearranged Ig genes
iHMMune-‐align HMM topology 1. Overview. 2. Details
0 1 2 3 4 5 6 7 8 9
iHMMune-‐align
IMGT/VQUEST + JCTA
IgBLAST
Ab-‐origin
JOINSOLVER
SoDA
VDJSolver
IGHV (%)
IGHD (%)
IGHJ (%)
Percent incorrect assignment of alleles using a benchmark dataset of known genotype
Gaëta et al (2007) Bioinforma-cs 23:1580 Jackson et al (2010) Bioinforma-cs 26:3129 Chen et al (2010) Immunome Research 6 (Supp 1): S4
NeCTAR Genomics Virtual Laboratory
http://genome.edu.au
Goals:
• Community infrastructure for genome researchers
• Build on Australia’s Research Cloud
• Manage & analyze massive datasets
• ‘Galaxy’ workflow engine
• • ‘Science Collaboration Framework’ from Harvard
Sydney Computational Biologists Meetup
http://meetup.com/Sydney-Computational-Biologists
Goals:
• Meet every few months
• Talks from prominent bioinformaticians
• Hosted at Google office near CBD
• Opportunity to socialise
• Open to all
Special Issue on Visualizing Biological Data
http://nature.com/nmeth/journal/v7/n3s
Reviews of visualization tools for:
• Genome data
• Systems biology data
• Bioimage data
• Alignments & phylogenic data
• Macromolecular structures
VIZBI Conference Series & Website
http://www.vizbi.org
Next conference:
20-22 March 2013Broad Institute, Cambridge MA, USA
Website:
• 80 videos
• 254 posters
• > • > 32,000 unique visitors
Resources for Bioinformaticians& Computational BiologistsSeán I. O'Donoghue
1,2 ([email protected]), Christian Stolte
1 ([email protected]), & Kenny Sabir
(1) CSIRO Mathematics & Information Sciences, Sydney & (2) Garvan Institute for Medical Research, Sydney
Explore linear and non-linear trends
Network visualization of correlated parameters
Kinase –substrate networks
Pleiotropy
Protein interaction networks
Proteomics
Transcriptomics
Transcriptional regulatory networks
Sequence-based
MIC
A Multi-dimensional Matrix for Systems Biology Research
1 2
3
How to find novel associations between ‘-omics’ parameters?
E-mail: [email protected]
Chi Nam Ignatius Pang, Apurv Goel, Simone S. Li, and Marc R. Wilkins
Identifying key insulin responsive pathways in plasma membrane trafficking of adipocytes
1 Garvan Institute of Medical Research, Sydney, Australia; 2 School of Mathematics and Statistics, University of Sydney.
Background
Methods
Contributions
Case study
With the recent explosion of high-throughput data many researchers have moved from studying a few molecules of interest to studying whole systems of molecules. The scale of high-throughput proteomics experiments can make both biological and statistical interpretation quite difficult due to the large number of proteins analysed and small number of replicates often performed. Analysing proteins in pathways (or any other group structure) can aid both these issues by increasing statistical power and increasing biological interpretability.
Pengyi Yang1,2, Shi-Xiong Tan1, Daniel Fazakerley1, Ellis Patrick2, James Burchfield1, Chris Gribben1, David James1, Jean Yang2
Mix lysates 1:1:1
Basal(light)
Insulin(medium)
Wortmannin(heavy)
Cationic colloidal silicaplama membrane
enrichment
Mix lysates 1:1:1
Basal(light)
Insulin(medium)
MK-2206(heavy)
Subcellular fractionationplama membrane
enrichmentIII
III IV
θ
Multi-dimensional pathway analysis
Two sets of quantitative mass spectrometry experiments were performed to interrogate insulin actions in plasma membrane proteome in3T3-L1 adipocytes.
In this first experiment, cells were SILAC labelledand either treated with insulin or Wortmannin (a PI3-kinase inhibitor) prior to insulin, or left untreat. Cell lysate were then extracted and mixed with a ratio of 1:1:1 and fractionated for plasma membrane enrichment. The second experiment adopted a similar design but usedMK-2206 (an Akt inhibitor) instead of Wortmannin.
This test can be used to test alternate hypotheses:
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
Log2(Ins vs. Basal)
Log2
(Wor
t+In
s vs
. Bas
al)
0.0 0.2 0.4 0.6 0.8 1.0
−0.6
−0.4
−0.2
0.0
0.2
AP3B1
RAB5C
NAPA
HSPA8
AP3S1STX4
AP1B1
VAMP7SNAP23
ARF1
SNX9
TFRC
M6PR
STX12
VAMP2
VAMP8
STX6
STX7
STX8
IGF2R
AP1S1
SNX2
CLTCAP1G1
STX2
GLUT4
SORT1
Log2(Ins vs. Basal)
Log2
(Wor
t+In
s vs
. Bas
al)
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
0.0 0.5 1.0 1.5
−0.2
0.0
0.2
0.4
0.6
DNM2
M6PRNAPA
RAB5C
SNAPIN
SORT1
STX12
STX7 STX8
TFRC
TGOLN2TXNDC5
VAMP2
VAMP7
VAMP8
IGF2R
SNX2
STX16
STX6
GLUT4
Log2(Ins vs. Basal)
Log2
(MK+
Ins
vs. B
asal
)
Log2(Ins vs. Basal)
Log2
(MK+
Ins
vs. B
asal
)
PI3K
PIP3
Akt
PKC
GLUT4vesicle
IRS
PDK1
TBC1D1
AS160
wortmannin
MK
SNARE
?
Insulin receptor
Insulin
?
Using the revolutionary pathway analysis, weidentified on four tested directions significantlyregulated pathways. In particular, the testrevealed that several plasma membranetrafficking associated pathways have alteredexpressions in insulin and inhibitor treatments.
We subsequently validated several key proteins inthese pathways by immunobloting.
We then extracted the proteins that appear morethan three times in the top-10 pathways. The following figures show their location on the scatter plot. It was found that key proteins involved in the pathways are significantly blocked by PI3-kinase inhibitor Wortmannin, but are notblocked completely by Akt inhibitor MK-2206.
Step 3. The rotated quantiles are then convertedinto p-values and combined using Fisher’s method.
Step 4. A univariate pathway analysis is thenperformed on the combined p-values; we use Fisher’s method again to combine all the p-valuesof each protein in a pathway.
Insulin is a key hormone that dictates various cellular processes including cell survival, growth,and poliferation. Many actions of insulin are mediated via the activation of phosphatidylinositol 3 kinase (PI3-kinase), which will then activate several downstream signalling pathways including the serine/threonine AGC kinase Akt/PKB. In adipocytes, one major effect of insulin mediated activation of the PI3-kinase-Akt pathway is for the translocation of the insulin–responsive glucose transporter GLUT4 to the plasma membrane for glucose uptake. Although it is well known that insulin regulates many of these and other cellular processes, the action of insulin and the dependency on PI3-kinase and/or Akt is not fully understood.
1. Proteolytic Cleavage of SNARE Complex Proteins2. Clathrin Derived Vesicle Budding3. Botulinum Neurotoxicity4. Golgi Associated Vesicle Biogenesis5. Membrane Tra�cking6. Lysosome Vesicle Biogenesis7. NCAM1 Interactions8. Signaling by PDGF9. NCAM Signaling for Neurite out Growth10. Triacylglyceride Biosynthesis
Ins+Wort vs. BasalIns vs. Basal
1. Gene expression2. Cytosolic tRNA Aminoacylation3. Golgi Associated Vesicle Biogenesis4. Proteolytic Cleavage of SNARE Complex Proteins5. Botulinum Neurotoxicity6. Clathrin Derived Vesicle Budding7. Insulin Synthesis and Secretion8. Metabolism of Proteins9. Membrane Tra�cking10. Pyruvate Metabolism and TCA Cycle
Ins+MK vs. BasalIns vs. Basal
These results confirmed that the plasma membrane associated pathways are mainly inhibited by Wortmannin but only partially by MK-2206.
VAMP-2Syntaxin-16Syntaxin-6
B I I+W I+M0
0.4
0.8
1.2
PM le
vel
VAMP-2
Syntaxin-16
Syntaxin-6
Pan Cadherin
B I I+W I+MB I I+W I+MTCL PM
We also make use of a dual colour GLUT4 construct as a read out for GLUT4 traffickingevents and find the consistent trend as thoseobserved in mass spectrometry results andimmunobloting results.
GARVANINSTITUTE
THE UNIVERSITY OF
SYDNEY
Multi-dimensional Pathway Analysis
Time (min)
GLU
T4m
em(F
OB
)
GLU
T4m
em(F
OB
)
-10 0 10 20 30 400
2
4
6
0
2
4
6ConMKWort.
BasalInsulin
MKWort.
We propose a multi-dimensional pathway analysis methodology that extends all the benefits of traditional pathway analysis to situations where there are more than two biological conditions of interest. This methodology provides a framework that makes it conceptually easy to identify statistically altered pathways in a high-dimensional space.
Our methodology allows a researcher to hypothesise whether:
1) a set of proteins are over-expressed in treatment A and under-expressed in treatment B relative to basal.
2) a treatment instituted a change in the transcription of a set of genes but has had no impact on their translation into their corresponding proteins.
In our case study, we applied this method and identified several key insulin responsive pathways that are mediated by PI3-kinase and/or Akt signalling cascades.
controls
responding proteins
Consider a protein p in pathway P. Let t1 and t2 be test statistics for the protein p from comparisons between basal condition and treatments one and two, respectively. Our method for testing if a pathway P is differentially expressed with respect to an alternate hypothesis Ha is as follows:
Step 1. The test statistics are normalized to followthe quantiles of a normal distribution.
Step 2. These quantiles are projected into polarcoordinates which are then rotated such that thequantiles are orientated in the direction of thealternate hypothesis.
III
III IVTreat 1 vs. Basal
Trea
t 2 v
s. Ba
sal
θθ0
θπ/2
θπ
θπ3/2
X2 = -2 (ln(Px ) + ln(Py ))θ0 θ0X2 = -2 (ln(Px ) + ln(Py ))θπ/2 θπ/2
X2 = -2 (ln(Px ) + ln(Py ))θπ θπ
X2 = -2 (ln(Px ) + ln(Py ))θπ3/2 θπ3/2
PathwayDatabase
p p
pHa : t1 > 0, t2 > 0 p P p
pHa : t1 > 0, t2 < 0 p P p
pHa : t1 < 0, t2 > 0 p P p
pHa : t1 < 0, t2 < 0 p P p
Functional annotation of “missing” proteins - a bioinformatics approach Shoba Ranganathan1, 4, Javed M. Khan1, David E. James2 & Mark S. Baker3
1Department of Chemistry and Biomolecular Sciences and ARC Centre of Excellence in Bioinformatics, Macquarie University, Sydney, NSW, Australia 2Diabetes and Obesity Program, Garvan Institute of Medical Research, Sydney, NSW, Australia
3Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW, Australia 4Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
Background Results for Chromosome 7
Future directions
Introduction: The Chromosome-centric Human Proteome Project (C-HPP) aims to systematically map all human
proteins. This mapping will lead to a knowledge-based resource defining the full set of proteins
encoded in each chromosome and laying the foundation for the development of a standardized
approach to analyse the massive proteomic data sets currently being generated. The neXtProt
database lists 20,111 proteins as the complete human proteome. However, several of these proteins
have no evidence at the proteomic or structural levels. For example, 203 (28%) proteins of human
chromosome 7 are considered “missing” as they lack experimental evidence. We have developed a
protocol for the functional annotation of these proteins by integrating several bioinformatics analysis
and annotation tools including protein domain mapping, Gene Ontology (GO) mapping, Kyoto
Encyclopedia of Genes and Genomes (KEGG) pathway analysis and interactome analysis.
Functional annotation of a large percentage of these “missing” proteins will be presented and these
annotations can be visualized using the ProteomeBrowser. This prototype generic methodology can
be extended to functionally annotate “missing proteins” from any species.
Aims: This project aims to functionally annotate the “missing proteins” in human proteome, initially focusing on
chromosome 7 (Chr7): • Identification of the list of Chr7 missing proteins
• Homology mapping of full length and protein domains against near neighbor species
• Categorization of the types of protein families representative in the list of missing proteins
with a focus on membrane proteins
• Assembling a list of key missing proteotypic (e.g. tryptic peptides) and familiotypic peptides (e.g.
domain and class specific peptides); and
• Obtaining high quality functional annotations for these “missing” proteins, thereby contributing to
the development of the ProteomeBrowser visualization tool.
Figure 2: BLASTP results show 170 hits against reviewed non-human
mammalian proteins, with 17 hits ≥ 98%, 37 ≥ 95%, 55 ≥ 90% sequence identity
and 33 proteins with no matches. The top three species matches were Mouse,
Rat and Bovine with 94, 18 and 16 hits, respectively.
Figure 1: A flowchart of the methodology used in our bioinformatics analyses. Phase I represents data
collection and pre-processing. Phase II depicts the various analyses carried out to obtain high quality
annotations for the missing proteins from each chromosome.
Figure 3. A (above): In silico tryptic digestion for the 33 missing proteins has
resulted in 586 peptides for further analysis to identify proteotypic and familiotypic
peptides. B (below): InterProScan has identified protein domain profiles for 167
of the 203 proteins for in-depth analysis and cross referencing with BLASTP
results.
Figure 5: Blast2GO results for the 203 proteins reveal that majority of the these proteins are; involved in binding (90), localized
within the cell (111) and take part in biological processes such as cellular process (92), biological regulation (68) and response to
stimulus (61).
Figure 4 (below): Results for BLASTP run against the PDB. This figure shows
the top 10 hits sorted by the highest identity and alignment length. Atleast two hits
have a high % identity (≥ 98%) and alignment length. In all 84 hits have been
retrieved. Further investigations into the actual lengths of query coverage will
unravel compelling prospects for homology modeling and other structural studies.
Figure 6: KEGG pathway analysis using KOBAS resulted in 178 matches out of which 83 and 88 proteins were identified with
most enriched pathway and disease terms, respectively. This figure shows the top 7 pathways with the maximum number of
proteins belonging to them, where Signal transduction and Olfactory transduction are the top two with 16 and 13 proteins,
respectively.
Methods
Cellular component Molecular function
Biological process
1. In-depth analyses by cross referencing of all the results.
2. Identification of potentially proteotypic and familiotypic peptides from the In silico tryptic digestion results for the 33
missing proteins for mAb production in collaboration with Monash Antibody Technologies Facility (MATF) and
proteomic identification from MS data.
3. Development of a generic semi-automated annotation pipeline for application to other human chromosomes as well
as proteomes of other organisms.
Acknowledgement
We are grateful to Mr. Gagan Garg, Macquarie University, for valuable bioinformatics assistance and the ANZ Chr7 C-HPP team for their support and suggestions.
Rupert ShuttleworthVictor Chang Cardiac Research Institute and Faculty of Engineering, University of New South Wales
Cloudbusting: Victor ChangCloudbusting: FAST ANNOTATIONDONE CHEAP
At the Epigenetics Lab at VCCRI we have been experi-menting with cloud computing to see if we can speed up our annotation pipeline. So far we are able to annotate at about 150,000 reads per minute, and it costs us less than a dollar to annotate a million reads this way.
There is an extra time overhead for uploading and downloading data between the remote computers, but by using compression and parallel connections we are able to upload a million reads in about a minute (and download several times faster.)
It is early days, and there are more opportunities for cheaper costs and increased speeds that we have not tried yet. But the forecast is looking good so far.Servers Workers Annotation time (min) EC2/EMR cost ($)
1 2 201 2.721 5 67 1.841 10 39 1.321 20 30 2.121 39 27 3.642 20 26 2.642 38 25 4.08
Hadoop Cluster (Amazon EMR/EC2)
Amazon S3
SAM file
Compressed SAM filecompress
Compressed SAM part 1
Compressed SAM part N
Compressed SAM part ...split
Compressed SAM file
Begin Send data to cloud Annotate data in cloud
SAM file
decompress Annotated SAM file part ...
Annotated SAM file part 1
annotateannotate
Compressed annotated SAM
file part ...
Compressed annotated SAM
file part 1
compresscompress
Annotated SAM file part N
annotate
Compressed annotated SAM
file part N
compress
Download data from cloud End
Compressed annotated SAM
file part 1
Compressed annotated SAM
file part ...
Compressed annotated SAM
file part N
Annotated SAM file part 1
Annotated SAM file part ...
Annotated SAM file part N
Annotated SAM filemerge
uncompress
uncompress
uncompress
upload
download
OCAP Pipeline
We have developed an open comprehensive
analysis pipeline for iTRAQ (OCAP) to facilitate the
data analysis. The OCAP pipeline integrates a
number of our new algorithms, and includes all the
analysis components:
(1) a peak identification algorithm (DyWave) [1];
(2) an adapted protein identification algorithm
(X!Tandem) [2];
(3) a protein quantification algorithm (WQuant);
(4) a suite of visualisation tools for quality control and
exploratory analysis of the raw data.
DyWave utilises a dynamic wavelet-based peak
identification algorithm which simplifies the peak
identification process and improves the accuracy;
see Figure 2 for details. The incorporated X!Tandem
protein identification algorithm has been shown to
outperform SEQUEST and MASCOT on some
datasets [3]. WQuant achieves protein quantification
by dynamically identifying and extracting iTRAQ
signals from noise in a data-driven fashion.
OCAP Pipeline And A New Hybrid
Protein Identification Method
Penghao Wang1,2, Jean Yang1, Susan R. Wilson2,3,4
1. School of Mathematics and Statistics, University of Sydney, Australia
2. Prince of Wales Clinical School, University of New South Wales, Australia
3. School of Mathematics and Statistics, University of New South Wales, Australia
4. Mathematical Sciences Institute, Australian National University, Australia
Introduction
Tandem mass spectrometry-based iTRAQ protein
quantification enables the determination of relative
expression levels for thousands of proteins
simultaneously. This provides a powerful means for
identifying disease biomarkers.
iTRAQ data analysis is complicated and involves
several analytical stages. Unfortunately, there are
very limited comprehensive analysis pipelines
available. Existing pipelines consist of tools that are
usually separately designed and developed. This
makes the analysis procedure cumbersome and may
lead to sub-optimal results.
Acknowledgements
The OCAP pipeline was funded by Australian
Research Council (ARC) Project DP094267.
NovoDB was funded by National Health and Medical
Research Council (NHMRC) grant 525453.
NovoDB – A Hybrid Protein
Identification Method
De novo sequencing based protein identification is
the only feasible approach for finding new proteins
and is an effective method for studying protein post-
translational modifications. In order to further
increase the protein identification accuracy and
coverage, we have recently developed a new hybrid
protein identification method – NovoDB [5].
NovoDB differs from existing de novo sequencing
methods which rely on finding one maximum path
from a constructed spectrum graph. NovoDB applies
a novel Bayesian network and dynamic programming
hybrid algorithm to explore the sub-optimal solution
space. Thus NovoDB can better accommodate
various interferences and artefacts present in the
spectra. Evaluated on a large number of spectra,
NovoDB outperforms the most popular de novo
sequencing methods and can improve the accuracy
of de novo sequencing-based protein identification;
see Figure 6. We are currently working on extending
NovoDB to identify protein modifications and
incorporating it into OCAP.
References
[1] Wang P., et al. (2010) Bioinformatics, 26(18):
2242-2249.
[2] Craig R. and Beavis R. (2004) Bioinformatics,
20(9): 1466-1467.
[3] Balgley B.M., et al. (2007) Mol. Cell Proteomics,
6: 1599-1608
[4] Wang P., et al. (2012) Bioinformatics, 28(10):
1404-1405.
[5] Wang P., et al. (2012) BIOCOMP12, pp. 74-81.
[6] Keller A., et al. (2005) Mol. Syst. Bio., Epub2005.
OCAP is offered as both a standalone system and an
R package. The R version of OCAP provides a
convenient interface for downstream statistical
analysis that may significantly facilitate the iTRAQ
data analysis [4].
OCAP is able to generate results in either a fully
automatic or a stepwise manner. Under the
automatic mode, OCAP directly produces peptide
and protein level identification and quantification
results through a single function, which greatly
facilitates the data analysis. The analysis can also be
completed separately for each component, enabling
users to either perform separate analysis on
intermediate results or export the results to other
statistical software. The diagrammatic view of OCAP
and its main functionalities are given in Figure 3.
OCAP provides a series of visualisation tools, and
some examples are given in Figures 4 and 5.
Prince of Wales Clinical School
Figure 1. The typical analysis workflow of iTRAQ data analysis, including 3 main components: pre-processing; protein ID; and protein quantification.
Figure 3. Overview of the OCAP pipeline and its main analytical functionalities.
Figure 4. Peptide expression image plot for a specific protein.
Figure 5. Protein ID coverage and confidence plot. The identified peptides are marked in colours, otherwise marked in black.
Figure 2. Using Continuous Wavelet Transform, the peak identification procedure can be simplified compared with the traditional peak identification procedure.
Figure 6. The performance of NovoDB. The x-axis is the identified peptide length in number of amino acids, the y-axis is the identification accuracy.
Conclusion
Our OCAP pipeline can greatly facilitate the data
analysis for iTRAQ. Based on our results, OCAP
performs favourably compared with the TPP pipeline
[6]. However, the adapted X!Tandem protein
identification algorithm can be improved. Therefore
we have developed a hybrid protein identification
method to address the issue. The hybrid method
introduces a two-step identification framework. Using
the framework, we are currently developing new
methods that can detect protein modifications
simultaneously with the protein identification process,
so that the identification coverage can be
significantly improved.
26
T
Aim
Genomic analysis
Integration by pathway model
Integrate Genomic and Functional data
Pathway
Our group aims to apply computational and statistical methods to model biological and clinical questions related to cancer biology and translational research. In collaboration with the Pancreatic Cancer and Signal Transduction Groups at Garvan/TKCC, and Prof. Grimmond group in Queensland Centre for Medical Genomics, we are undertaking integrative analysis of multidimensional “-omics” datasets, generated by deep sequencing of the cancer genome and profiling of the transcriptome (the set of all messenger RNA molecules), epigenome (the inheritable changes regulating gene expression without alerting the underlying DNA sequence) and proteome (the entire set of proteins expressed or phosphorylated), with the aim of identifying candidate driver mutations and pathway aberrations in pancreatic cancer.
Cancer Bioinformatics
Integrative analysis of multiple –omics data for Pancreatic Cancer
Mark Cowley, Mark Pinese, Emily Stoddart, Roger Daly, Andrew Biankin, Jianmin Wu The Kinghorn Cancer Centre & Cancer Research Program, Garvan Institute of Medical Research, Sydney email: [email protected]
Proteomic analysis
Mia
Paca−2
Panc−1 X−
0203
Hs7
00T
Hs7
66T
SW19
90 Cap
an−1
Cap
an−2
X−03
27 X−04
03X−
0813
PL45
X−05
04H
PAC
AsPC
−1H
PAF−
II BxPC
3C
FPAC
−1SU
8686
2030
4050
60
No 1 frequent clustering result ( 83 times)
hclust (*, "complete")d
Hei
ght
Acknowledgements: We thank the Australian Pancreatic Genome Initiative (APGI) and all participating clinicians for their support and the high-quality samples used in this study.
Analysis platforms we developed
CategoryOvarianRhabdomyosarcomaColonMeningiomaLungNSCLCPancreasBreastEndometrialRenalCellCarcinoma
GBMEsophagealLungSCLCBladderMelanomaLiverOsteosarcomaGastricMultipleMyelomaLeukemia
ColonOrMelanoma
relative
- 2 0 2
Category
A2780_OVA
RYCA
OV3_O
VARY
CAOV4_O
VARY
COLO
704_OVA
RYCO
V362_O
VARY
COV434_O
VARY
COV504_O
VARY
EFO21_O
VARY
EFO27_O
VARY
HEYA
8_OVA
RYIGRO
V1_O
VARY
JHOC5_O
VARY
KURA
MOCH
I_OVA
RYNIHO
VCAR
3_OVA
RYOV90_OVA
RYOVC
AR4_OVA
RYOVC
AR8_OVA
RYOV
ISE_OV
ARY
OVM
ANA_OVA
RYRK
N_OVARY
RMGI_OVA
RYSN
U840_O
VARY
TOV112D_
OVA
RYTO
V21G
_OVA
RYTYKN
U_OVA
RYC2BBE1_LAR
GE_INTESTINE
COLO
205_LARG
E_INTESTINE
DLD1_LAR
GE_INTESTINE
GP2D_LARG
E_INTESTINE
HT29_LAR
GE_INTESTINE
HT55_LAR
GE_INTESTINE
HUTU
80_SMALL_INTESTINE
KM12_LAR
GE_INTESTINE
LOVO
_LAR
GE_INTESTINE
LS411N_LAR
GE_INTESTINE
LS513_LARG
E_INTESTINE
NCIH508_LARG
E_INTESTINE
RKO_LARG
E_INTESTINE
SKCO
1_LARG
E_INTESTINE
SNUC
1_LARG
E_INTESTINE
SNUC
2A_LAR
GE_INTESTINE
SW48_LAR
GE_INTESTINE
SW480_LARG
E_INTESTINE
ASPC
1_PANC
REAS
BXPC
3_PANC
REAS
CFPAC1_PAN
CREAS
HPAC
_PAN
CREAS
KP1NL_PANC
REAS
KP4_PANC
REAS
L33_PA
NCRE
ASMIAPA
CA2_PA
NCRE
ASPA
NC0327_PAN
CREA
SPA
NC0813_PAN
CREA
SPA
NC1005_PAN
CREA
SQG
P1_PAN
CREAS
SU8686_PAN
CREA
SJHESOA
D1_OESOP
HAGU
SKYSE150_OESOPH
AGUS
KYSE30_OESOP
HAGU
SKYSE450_OESOPH
AGUS
KYSE510_OESOPH
AGUS
OE33_OESOP
HAGU
STE15_OESOPH
AGUS
TE9_OE
SOPH
AGUS
TT_OESOP
HAGU
SA549_LUN
GHC
C2814_LU
NGHC
C364_LUN
GHC
C827_LUN
GNC
IH1650_LUN
GNC
IH1975_LUN
GNC
IH2122_LUN
GNC
IH661_LU
NGLN215_CE
NTRA
L_NE
RVOUS
_SYSTEM
LN229_CE
NTRA
L_NE
RVOUS
_SYSTEM
LN319_CE
NTRA
L_NE
RVOUS
_SYSTEM
LN464_CE
NTRA
L_NE
RVOUS
_SYSTEM
SF767_CE
NTRA
L_NE
RVOUS
_SYSTEM
U251MG_CEN
TRAL_NER
VOUS
_SYSTEM
NCIH196_LU
NGNC
IH2171_LUN
GNC
IH82_LUN
GA2058_SK
INHS
944T_SKIN
IGR39_SK
INCH
157M
N_CE
NTRA
L_NE
RVOUS
_SYS
TEM
F5_CEN
TRAL_NER
VOUS
_SYSTEM
IOMMLEE_CE
NTRA
L_NE
RVOU
S_SYSTEM
HCC70_BR
EAST
MDA
MB453_BRE
AST
AGS_STOMAC
HHU
G1N_STO
MAC
HX786O_KIDNE
YSLR21_KIDN
EYRT
112_UR
INAR
Y_TR
ACT
COLO
741_SK
INHE
C1A_EN
DOMETRIUM
HL60_HAE
MATOPO
IETIC_AN
D_LYMPH
OID_TISSU
EHLF_LIVER
KMS12BM_HAE
MATOPO
IETIC_AN
D_LYMPH
OID_TISSU
ESJSA1_BO
NEA204_SOFT_TISSU
E
TRCN0000047740m_st CHN2TRCN0000113889m_st CKAP5TRCN0000065048m_st STX12TRCN0000074654m_st EFTUD2TRCN0000004813m_st TERF2TRCN0000063406m_st KDELR3TRCN0000074878m_st PHF5ATRCN0000113840m_st TUBA1ATRCN0000009427m_st OR5V1TRCN0000038008m_st TPK1TRCN0000003995m_st TTC3TRCN0000043707m_st SLC9A3R2TRCN0000044288m_st CLCA2TRCN0000017525m_st SHOXTRCN0000003411m_st UBR5TRCN0000054009m_st PCDHB14TRCN0000038733m_st DMBT1TRCN0000021278m_st PAX8TRCN0000019677m_st MCM6TRCN0000051316m_st XPNPEP2TRCN0000004251m_st VCPTRCN0000072794m_st FXYD4TRCN0000053057m_st POLG2TRCN0000003844m_st CTNNB1TRCN0000006044m_st CSNK1A1TRCN0000078269m_st ADSLTRCN0000044394m_st CACNA2D4TRCN0000055742m_st OIT3TRCN0000029025m_st ACACATRCN0000083710m_st JUPTRCN0000072392m_st TUBG1TRCN0000006289m_st BRAFTRCN0000008744m_st MLL3TRCN0000008402m_st GIT1TRCN0000037710m_st ITPKBTRCN0000107748m_st BHLHE22TRCN0000054132m_st LRP1BTRCN0000054318m_st CLEC4FTRCN0000078219m_st LGALS17ATRCN0000006340m_st EXOSC10TRCN0000044069m_st SLC25A13TRCN0000007869m_st EIF2C3TRCN0000036210m_st PFASTRCN0000005555m_st NUP214TRCN0000000911m_st VRK3TRCN0000033262m_st KRASTRCN0000048630m_st RHOVTRCN0000013715m_st ATF2TRCN0000058927m_st IL15TRCN0000057860m_st CXCL11TRCN0000020384m_st SOX9TRCN0000061495m_st OR10A5TRCN0000006754m_st CSTATRCN0000013988m_st TACR1TRCN0000051291m_st CDATRCN0000010143m_st PIP5K1ATRCN0000074998m_st GPATCH1TRCN0000000219m_st RGS2TRCN0000008822m_st GPR128TRCN0000022092m_st ZNHIT3TRCN0000051946m_st EYA2TRCN0000113804m_st DOCK5TRCN0000052058m_st KLK15TRCN0000043942m_st KCNA1TRCN0000106875m_st H1F0TRCN0000073311m_st C1ORF124TRCN0000016560m_st DMRTA2TRCN0000018909m_st TLX2TRCN0000017326m_st ID4TRCN0000062982m_st CD300CTRCN0000053970m_st THBS2TRCN0000059052m_st MTSS1TRCN0000078302m_st CRY2TRCN0000021183m_st KLF10TRCN0000008781m_st DNAJB6TRCN0000001046m_st EGLN3TRCN0000007667m_st UBL4ATRCN0000029805m_st SIRPGTRCN0000011570m_st GPR4TRCN0000065103m_st SNX15TRCN0000034742m_st GLT25D2TRCN0000062436m_st RDXTRCN0000034203m_st MYO18ATRCN0000003220m_st TSSK2TRCN0000010243m_st ETNK1TRCN0000016524m_st LOC401361TRCN0000048049m_st RHOFTRCN0000044876m_st TRPM5TRCN0000082439m_st LOC389873TRCN0000007088m_st NEK8TRCN0000040209m_st SHC1TRCN0000006478m_st ARHGAP5TRCN0000046807m_st TMPRSS11ATRCN0000116260m_st MBPTRCN0000001232m_st SRPK1TRCN0000073764m_st IMP5TRCN0000007271m_st PSMB8TRCN0000003271m_st SYT4TRCN0000016218m_st ZNF333TRCN0000000692m_st CASKTRCN0000061205m_st ZP4TRCN0000038806m_st RNF7TRCN0000058660m_st TNFRSF9TRCN0000029544m_st ATP6V1B2TRCN0000072438m_st KIAA0090TRCN0000056243m_st ANXA3TRCN0000016520m_st ZNF792TRCN0000038894m_st BRMS1TRCN0000001311m_st SREK1TRCN0000029661m_st MARCOTRCN0000022323m_st PEX10TRCN0000051731m_st RNPEPTRCN0000018986m_st SOX10TRCN0000013695m_st CEBPDTRCN0000017452m_st SOX18TRCN0000029761m_st PITPNM1TRCN0000038591m_st ATP2A3TRCN0000082761m_st FRMD1TRCN0000014284m_st VN1R5TRCN0000039682m_st CDKN1CTRCN0000051770m_st DNM1TRCN0000016799m_st OSR1TRCN0000045929m_st FPGSTRCN0000029808m_st SIRPGTRCN0000078217m_st CLPSTRCN0000049515m_st ASAH2TRCN0000053398m_st EHD4TRCN0000116121m_st GBP1TRCN0000003080m_st PDP1TRCN0000082465m_st PRKAR1BTRCN0000073661m_st CPAMD8TRCN0000060450m_st LILRP2TRCN0000062662m_st AGERTRCN0000003249m_st PTPRN2TRCN0000043995m_st KCNV2TRCN0000052570m_st MOBKL2CTRCN0000034183m_st TRIM74TRCN0000016054m_st LOC341415TRCN0000074115m_st GJB4TRCN0000042945m_st SLC12A2TRCN0000002006m_st PRKACBTRCN0000052244m_st MMP24TRCN0000061243m_st ITPR1TRCN0000074211m_st JUBTRCN0000016845m_st PRDM10TRCN0000022197m_st TTF1TRCN0000000247m_st IDETRCN0000018260m_st ZNF354CTRCN0000060764m_st GABARAPL1TRCN0000083069m_st COL11A2TRCN0000051975m_st SKIV2L2TRCN0000033746m_st TRIM3TRCN0000058770m_st LIFRTRCN0000056116m_st HRCTRCN0000045282m_st MRRFP1TRCN0000037596m_st CDKN2CTRCN0000000612m_st CSNK2A2TRCN0000045873m_st ASNSTRCN0000056789m_st CD80TRCN0000004443m_st USP33TRCN0000060773m_st CHRNA10TRCN0000054007m_st MYL3TRCN0000021020m_st GTF2H3TRCN0000007322m_st FJX1TRCN0000013834m_st EGR1TRCN0000074736m_st RBM10TRCN0000003176m_st ARHGEF2TRCN0000005392m_st GATA6TRCN0000018562m_st HOMEZTRCN0000029606m_st MS4A2TRCN0000037431m_st SPEGTRCN0000008355m_st FZD9TRCN0000051861m_st PLD6TRCN0000059949m_st SYTL2TRCN0000003897m_st PSMA7TRCN0000050269m_st RPUSD1TRCN0000055833m_st PRRG3TRCN0000047850m_st RHOBTRCN0000057935m_st CXCL5TRCN0000017052m_st ZFP28TRCN0000003125m_st FASNTRCN0000078495m_st OAZ1TRCN0000017357m_st MBD3TRCN0000015972m_st MEIS1
!
!
!
!
!!
!
!
!!!
!
!
!!!!
!
!!!!!
!
!!
!
!!!
!!
!
!!!!!
!
!
!
!
!
!!!!!
!
!!
!
!!!!!!!!
!
!
!!!!!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!!
!!
!
!
!
!
!
!
!
!!
!
!!
!
!
!!
!
!
!
!!
!
!
!!!!
!!
!!
!
!!
!
!
!!!
!
!
!!
!!!!
!
!
!!
!!!!
!!!!!
!
!
!
!!
!!!
!!
!!
!
!!!!
!!
!
!!!!
!!!!
!
!
!!!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!!
!
!
!!
!
!!
!
!
!
!
!
!!
!
!!!
!
!!!
!!!
!
!
!
!
!
!!
!
!
!!!!!!!
!
!!
!
!!
!
!
!
!
!!
!!!!!
AsPC
−1
BxPC
3
Capan−1
Capan−2
CFPAC
−1
HPAC
HPAF−II
HPD
E
HPD
E_KR
asV12D
Hs700T
Hs766T
MiaPaca−2
Panc−1
PL45
SU8686
SW1990
X−0203
X−0327
X−0403
X−0504
X−0813
X−1005
5
10
15
20
25
Orbi_pYdataset_normalized_imputed_complete_log2
The left figure is a word cloud of the genes with somatic mutations identified by deep sequencing of tumor and normal tissues from 99 pancreatic cancer patients. The size of texts corresponds to the frequency of mutations in the patient cohort. The right figure shows the summary of copy number changes, differential gene expression, somatic coding mutations and structural aberrations for a pancreatic cancer.
The top left figure shows the boxplot of intensities for 22 pancreatic cancer cell lines. The top right figure is the most stable phenotypic subtype using a stepwise hierarchical c lustering approach we devised. The left figure l i s t s t h e t o p 3 0 i m p o r t a n t phosphorylation sites ranked by the variable importance score from a Random Forest model we built using drug response data. The right figure shows the protein-protein interactions among the proteins with these important phosphorylation sites.
By integrating somatic muta t i ons and copy number variations, we identified Axon Guidance p a t h w a y g e n e s a r e enriched in the APGI cohort. In addition, by c omb ing exp r e s s i on prof i l ing and patient outcome data, we found low expression of the ROBO2 receptor was associated with poor patient survival, while h i g h e x p r e s s i o n o f ROBO3, a known inhibitor o f ROBO2 s igna l l ing d e m o n s t r a t e d a n appropriate reciprocal inverse association with poor survival.
To investigate the potential functional consequences of mutations and copy number variant genes in our cohort, we integrated our dataset with a large scale in vitro shRNA functional screen (data from Cheung et al., 2011), and two recent in vivo sleeping beauty transposon mediated insertion mutagenesis screens (Mann et al. and Perez-Mancera et al.), that were recently published. The right figure is a heatmap of the highest confidence cancer-lineage-specific oncogenes.
protein interaction network construction, filtering, analysis, visualization and management. It has been published in Nature Methods and Nucleic Acids Research and users are all over the world. The InterOmics platform (right figure) is a platform developed for integrating multi "-omics" data, and has been used in the ICGC Pancreatic Cancer project.
Our group developed and published several popular analysis platforms. Protein Interaction Network Analysis (PINA) platform (left figure) is an integrated platform for
This work is a part of the Interactional Cancer Genome Consortium (ICGC) Pancreatic Cancer project, lead by Australian Pancreatic Cancer Genome Initiative (APGI).
Re-Fraction: a machine learning R package for deterministic identification of protein homologues and slice variants in large-scale MS-based proteomics
Background
Methods
Avaliability
Aims
Results
Bottom up mass spectrometry (MS)-based proteomics relies on the identification of enzymatically digested peptides and subsequently infer potential proteins that could present in the sample based on the observed peptide sequences. This process is known as “protein inference”. A key challenge in protein inference is that a high percentage of identified peptides are shared among multiple proteins. This results in ambiguity in determining the exact identity of proteins present in the sample.
In particular, the shared peptides are especially common among protein homologues and splice variants making deterministic identification of these proteins a nontrivial task.
Design a computational approach to resolve ambiguity in peptide assignment and therefore, accurately distinguish proteins and their homologues and/or slice variants expressed inthe sample.
In proteomics studies, the sample complexity canbe reduced by fractionating proteins using SDS-PAGE prior to LC-MS/MS analysis.
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
020
040
060
080
0
Protein Mass
Mas
s (k
Da)
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
010
0030
0050
0070
00
Leng
th (a
a)
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
24
68
Log2
(Num
ber o
f Pep
tides
)
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
46
8
Isoe
lect
ric P
oint
Protein Length
Gel Fractions Gel Fractions
Gel Fractions Gel Fractions
Number of trypic peptides
1012
0 200 400 600 800
02
46
810
Reg
ress
ion
Val
ue
* colour corresponds to the actual gel fraction in which a protein is found
SVM regression w.r.t gel fractions using protein mass
Protein Mass (kDa) Mass
pI
Pept
ide
By applying Re-Fraction to a large-scaleproteomics data generated from the 3T3-L1plasma membrane proteome study in our lab,we showed that the algorithm can accurately assign each protein to its correspondingfraction.
Firstly, we evaluated the performance of themodel using 10-fold cross validation on thetraining data where the peptides are uniquelyassigned to a single protein by using MaxQuantsoftware. We calculated accuracy (ACC), sensitivity (SE) and specificity (SP) as follows:
“Big” proteins
“Small” proteins
SDS-PAGE
By utilising four protein physical properties aslearning features, we can model from which gel fraction each protein should be found. The following shows the predictive power of each feature alone on capturing the seperation of protein to gel fractions, respectively.
GARVANINSTITUTE
THE UNIVERSITY OF
SYDNEY
A support vector machine (SVM) regression model is applied to build a classifier using the four features described above. This is then followed by assigning each protein from a given protein database to their expected gel fractions.
The following figure shows the regression resulton using the learning feature of protein massalone and the combination of three features,respectively. By combining more features, we canimprove the model performance.
Since the fraction from which a peptide was identified is known, this information can be used to prevent the peptide from being assigned to unlikely or incorrect proteins based on their physical properties, even if all putative proteins in the protein group contain the same observed peptide sequences. We call this procedure“Re-Fraction”.
F1 0.981 0.824 0.999F2 0.988 0.936 0.994F3 0.986 0.944 0.992F4 0.967 0.937 0.975F5 0.955 0.863 0.968F6 0.957 0.801 0.978F7 0.946 0.753 0.968F8 0.952 0.628 0.975F9 0.964 0.722 0.975F10 0.981 0.737 0.996
ACC SE SP
After applying Re-Fraction, on the peptide level, we assign 2424 more unique peptides to their corresponding proteins. This acount for a 16%increase compared to the original result withoutapplying Re-Fraction.
On the protein level, we deterministically identify256 more proteins, which are roughly equallydistributed from each fraction in percentage.
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Additional (Re−Fraction) Original
Fraction
Num
ber o
f Det
erm
inis
tic P
rote
in Id
enti�
catio
ns
050
100
150
200
250
300
Original Re−Fraction
Num
ber o
f Det
erm
inis
tic P
rote
in Id
enti�
catio
ns
020
040
060
080
0
693
949
40 kDa
As a validation, Re-Fraction was able to distinguish RagA which are expressed in 3T3-L1but not RagB. Using immunoblotting, we confirmed this finding.
We have implemented Re-Fraction as a R packageand it could be applied to any proteomics datawith SDS-PAGE fractionation. The project homepage address is:
http://code.google.com/p/re-fraction/
The project homepage contains the R package ofRe-Fraction, the source code, the test datasets.Please following the steps below to install and usethe Re-Fraction R package:
1 Garvan Institute of Medical Research, Sydney, Australia; 2 School of Mathematics and Statistics, University of Sydney.Pengyi Yang1,2, Sean Humphrey1, Daniel Fazakerley1, Ma�hew Prior2, Guang Yang1, David James1, Jean Yang2
SE =TP
TP + FNSP =
TNTN + FP
ACC =TP + TN
TP + TN + FP + FN
where TP, TN, FP, and FN are true positives,true negatives, false positives, and false negatives, respectively.
Original Re−Fraction
Shared PeptideUnique Peptide
Num
ber o
f Pep
tides
050
0010
000
1500
020
000
5177
9742
7601
7318
Original Re−Fraction
Perc
enta
ge o
f Uni
que
Pept
ide
(%)
010
2030
4050
35%
51%
30 kDa
R CMD INSTALL ReFraction_0.2.tar.gz
(1) after downloading the package (current version: ReFraction_0.2.tar.gz), install the package on console as follows:
(2) Open an R window and load the package asfollows:
library(ReFraction)
(3) type the following to see the generalinformation:
?ReFraction
(4) type the following to see the applicationexample:
?applyReFraction
(5) extract protein properties from a fasta file:
extractDatabase(path to the fasta file)
Please visit the project homepage for more usagedetails and examples.