Sydney Bioinformatics Research Symposium 2012 posterbook

LEARNING & INTEGRATING PETRI NET MODELS OF BIOLOGICAL SYSTEMS

Ashwin Srinivasan 1, Michael Bain 2, Sandeep Kaur2,3 and Mark Temple4

1 Indrapastha Institute of Information Technology, New Delhi, India, 2 University of New South Wales, Sydney,

Australia, 3Garvan Institute, Sydney, Australia and 4University of Western Sydney, Sydney, Australia

1. Introduction

Qualitative modelling (QM) approaches for biological applications have some limitations: spurious behaviours; lack of concurrency; not easily extended to continuous or stochastic representations [1]. Petri nets (PN) are a QM approach that avoids these limitations. PNs have a strong formal basis and have been widely used in modelling biological systems. However, relatively little work exists on learning such models from biological data. We introduced a definite clause representation for PNs called Guarded Transition Systems and show that a known combinatorial algorithm can be formulated as a search through a lattice of clauses, enabling the use of ILP to learn PNs [2]. Advantages include: efficient search of ILP hypothesis space, compared to previous algorithm; extending representation to identify regulatory and metabolic models in the same modelling framework; using existing networks in background knowledge to learn hierarchical models.

2. A Petri net example

[1] Srinivasan and King (2008) “Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming”. Journal of Machine Learning Research, 9:1475–1533. [2] Srinivasan and Bain (2012) "Knowledge-Guided Identification of Petri Net Models of Large Biological Systems", pp. 317-331, LNAI 7207, Springer. [3] Kaur (2012) "Phenotype Prediction with Models of Cellular Systems". Honours Thesis, School of Computer Science and Engineering, UNSW. [4] Temple, Perrone and Dawes (2005) "Complex cellular responses to reactive oxygen species". TRENDS in Cell Biology, 15(6):219-326.

3. Petri net reconstruction using ILP 4. Model integration: phenotype prediction

A learned PN model for yeast pheromone response. Uses several generic components of signalling pathways encoded as guarded transitions.

4. References

A Petri net representing construction of water (transition – bar) from reactants (places – circles).

Initial marking.

Final marking.

PN model [3] of yeast response to H2O2 [4] integrated deletant phenotypes, transcriptomics and proteomics data, highlighting potential pathways.

Gene regulatory networks in heart development!

Bouveret R., Doan T., Ramialison M., de Jong D., Schonrock N., Chapman G., Chen C.M., Bhattacharya S., Dunwoodie S.L. and Harvey R.P.

The Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW 2010, Australia

ABSTRACT!The heart is the first organ to form, and the ongoing viability and growth of the embryo depends vitally on its evolving functional output. To achieve this an organism requires a remarkable degree of spatio-temporal

gene expression control orchestrated largely by a conserved gene regulatory network. Understanding the structure and dynamics of this complex web of biological interactions that specifies the identity and function

of cells and organs is essential for understanding normal development and is a prerequisite to exploring complex human diseases. In order to build the cardiac gene regulatory network at a Systems Biology level we

have adapted the DNA adenine methyltransferase identification (DamID) method to define target genes of key transcription factors that play a major role in this process. We have applied DamID to a differentiated

atrial cardiac cell line where we have been able to: 1) analyze and compare the binding sites and target genes of >15 key cardiac transcription factors; 2) identify new regulatory co-factors that play essential roles in

cardiac development; and 3) study the effect of disease-causing mutations on transcription factor DNA-binding activity and specificity at the genome level. We now aim to apply DamID to different cardiovascular

progenitor cell populations isolated from embryonic stem cell cultures. In conclusion, we have used DamID to elucidate the dynamic interaction between specific transcription factors and gene cis-regulatory elements

to explore the regulatory logic of cardiac development and disease.

"DamID method (van Steensel et al., 2001; Vogel et al., 2007)!!We adapted the DNA adenine methyltransferase identification (DamID) technique because:

ü  It does not rely an specific antibodies for immuno-precipitation (isoforms, mutants, family members etc)

ü  it is very sensitive and can be applied to small amounts of material

•  HL-1 cardiomyocyte cells (Claycomb at al., 1998)

•  Simple and homogenous system

GATC GA m TC GATC GA m TC

DpnI digest

GATC GA m TC

microarrays

GA m TC

Dam

N-term C-term Nkx2-5

LM-PCR amplification

Dam

N-term C-term

Dam

controls

Nkx2-5 mutants

Dam

N-term C-term

Dam

N-term C-term Dam

Cbx1

Identification of Nkx2-5 target genes

NKX2-5 is a homeodomain factor that sits at the very top of the cardiac regulatory hierarchy.

It is to date the most commonly mutated single gene in congenital heart diseases.

We performed DamID using an Affymetrix whole-chromosome microarray

ü  Nkx2-5 peaks are significantly enriched in proximal (5’ and 3’) and intragenic

regions

ü  The density of Nkx2-5 peaks is highest immediately upstream of the TSS

ü  The NKE is the most over-represented motif in 5’ proximal regions

ü  Nkx2-5 peaks are conserved only amongst closely related species

We also performed DamID using an Affymetrix promoter microarray

ü  Nkx2-5 target genes are active in cardiomyocytes and cardiac tissue

ü  Nkx2-5 target genes are enriched in cardiac GO terms (heart development, muscle

contraction and cell proliferation, etc) (GREAT: McLean et al., 2010)

Nkx2-5 WT and Nkx2-5 ∆HD

ETS transcription factors have essential functions in heart development

ETS transcription factors at the heart of the cardiac Gene Regulatory Network

1288

407

669

Nkx2-5

WT (a) Nkx2-5

!"#$%c)Pe

ak

s

b

Nkx2

-5 w

ild-t

yp

e

Hoechst anti-V5

!"#$%&'()*

HL

-1 c

ard

iom

yo

cyte

s

0.00

0.01

0.04

0.05

Norm

alis

ed lucifera

se u

nits

F(1

)Nkx2

-5 +

F(2

)Nkx2

-5

F(1

)Nkx2

-5 +

F(2

)GS

T

F(1

)GS

T +

F(2

)Nkx2

-5

F(1

)Nkx2

-5 !"#

+ F

(2)N

kx2

-5

F(1

)Nkx2

-5 !"#

+ F

(2)G

ST

F(1

)GS

T +

F(2

)Nkx2

-5

F(1

)Nkx2

-5 !"#

+ F

(2)N

kx2

-5 !"#

F(1

)Nkx2

-5 !"#

+ F

(2)G

ST

F(1

)GS

T +

F(2

)Nkx2

-5 !"#

F(1

)GS

T +

F(2

)GS

T

(po

sitiv

e c

on

tro

l)

********

**

***

*

**

distal

5!

intragenic

3!

distal

47%36%

12%

5%

Nkx2-5 peaks

38%53%

6%3%

probes

-10

kb

-

TS

S -

TE

S -

+5

kb

-

Nkx2-5 DamID results promoter array

Trawler Weeder

#1

#23!

intragenic 5!

distal

Nkx2-5

TRANSFAC M00240

Nkx2-5 DamID results WC array

-10000 -5000 0 5000 10000

0e

+0

01

e-0

52

e-0

53

e-0

54

e-0

5

Nkx2-5 WCA Peaks distribution in DAM-ID epxeriments

distance from TSS (bp)

De

nsity

100

75

50

25

0

% e

mbry

os

- - - - - - -

- - - - -

- - - - - - -

- - - - -

3 6

66 6 6

2.6 2.6

231 4570871168610459386

elk1 Y167A Elk1 elk4

control-MO

elk1-MO

elk4-MO

mRNA

n

uninjected elk1-MO elk4-MO

WT

Curly tail

Unlooped heart

Dead

Nkx

2-5

+ Nkx

2-5

Nkx

2-5

+ GST

GST +

Nkx

2-5

Nkx

2-5

+ Elk1

Nkx

2-5

+ GST

GST +

Elk1

Nkx

2-5

+ Elk1

Y15

8A

Nkx

2-5

+ GST

GST+

Elk1

Y15

8A

Nkx

2-5

dHD +

Elk1

Nkx

2-5

dHD +

GST

GST +

Elk1

Nkx

2-5

+ Elk4

Nkx

2-5

+ N-G

ST

N-G

ST +

Elk4

Nkx

2-5

dHD +

Elk4

Nkx

2-5

dHD +

GST

GST +

Elk4

SRF +

Elk1

SRF +

GST

GST +

Elk1

SRF +

Elk1

Y15

8A

SRF +

GST

GST+E

lk1

Y15

8A

0.000

0.005

0.010

0.030

0.035

0.040 !!!!

!!!!

!!!!

!!!!

!!!!

!!!!

!!

!!

!!!!

!!

!!!!

!!!!

!!!!

!!!!

ns

!

Wei et a

l., 2010

Tra

wle

rW

eeder

Nkx2-5

WT

!"#$%&'()*

CV-1 cells

Ho

ech

st

exog

en

ou

s N

kx2

-5e

xo

ge

no

us E

lk1

Merg

e

Hoechst anti-V5 anti-HA merge

Nkx2

-5 w

ild-t

yp

e!"#$%&'()*

co

-exp

ressio

n

CV

-1 c

ells

Elk

?

Nkx2

-4

Nkx2

-5G

AB

PA

Elk

1

a

b

c

? = Elk1/4

Elk1

SRF

Elk4

834

602

303

1459

92

187

78

pe

ak

s

#1

#1

Elk

1E

lk4

SR

F

#1Wei et al., 2010

Wei et al., 2010

Badis et al., 2009

(Weeder: Pavesi et al., 2004; Trawler: Ettwiller et al., 2007)

Funding: National Health and Medical Research Council (573705) and

Australian Research Council (DP0988507)

The homeodomain (HD) of Nkx2-5 is not

essential for DNA-binding

ü  HD is not essential for

homodimerisation

ü  The Nkx2-5 WT/Nkx2-5 ∆HD dimer

binds DNA directly

ü  Nkx2-5 ∆HD binds a new set of target

genes through protein-protein

interactions

Rluc

-PCA

FRET

IF m

icro

scop

y

Nkx2-5 interacts with ETS factors

ü  ETS factors carry Nkx2-5 ∆HD into the

nucleus

ü  ETS factors interact with Nkx2-5 WT and

Nkx2-5 ∆HD independently of SRF

ü  Elk factors are essential for heart

development in zebrafish

We performed DamID with T-box proteins,

Gata4, Elk factors and SRF

ü  ETS factors are ubiquitous factors that

have general functions

ü  Elk factors are major contributors in the

cardiac Gene Regulatory Network

ü  Elk factors co-regulate Nkx2-5 target

genes extensively

NGS

Computational analysis of large rearranged immunoglobulin gene sequence sets

•  Gene$c varia$on in the immunoglobulin (Ig) locus impacts the immune response

•  However li:le is known about this varia$on as major genomics projects (HGP, HapMap, 1000 Genomes) have bypassed the Ig locus

•  Ultra-‐deep sequencing of rearranged Ig genes obtained from a subject’s B-‐lymphocytes provides a window on this varia$on

Clonal-‐Relate A new distance metric and clustering algorithm to iden$fy clonally-‐related Ig sequences in 454 sequencing datasets

Zhiliang Chen1, Marie Kidd2, Katherine Jackson2, Yan Wang2, Mike Bain1, Andrew Collins2, Bruno Gaëta1 1School of Computer Science and Engineering, UNSW 2School of Biotechnology and Biomolecular Sciences, UNSW

Blood sample Rearranged IGH gene sequences (VDJ)

Mul$plex PCR Sequencing (454)

Our focus is on the development of bioinforma$cs methods for analysing rearranged Ig sequences to understand immunoglobulin diversity at the germline level

Genotyping For each sequence set obtained from a subject, find the combina$on of alleles most likely to generate the observed data, through a combina$on of sequence alignment (dra_ genotyping) and applica$on of a maximum likelihood model based on iHMMune-‐align

4. An Automated Method for Genotyping the Human Immunoglobulin HeavyChain Variable Region Locus

the maximum likelihood genotype is defined as:

argmaxG

P (S|G) (4.1)

where P (S|G) is the probability of the sequence set S given a genotype G.

In each individual, the gene composition will normally be homozygous (one

allele) or heterozygous (two alleles), however in cases of gene duplication that

have not been recognized by the WHO/IUIS/IMGT immunoglobulin gene nomen-

clature committee, there may be three or even four alleles of the same gene

present. For each IGHV gene, the potential genotypes are the subsets (G =

{g1, g2, g3, ..., gn}, 1 n 4) of up to four alleles of the allele set identified in the

draft genotyping step.

Therefore P (si|G) can be estimated as:

P (si|G) =X

gn

2G

P (si|gn)P (gn|G) (4.2)

There are currently no data describing allele specific rearrangement frequen-

cies, so we assume that the possibility of each allele appearing in a genotype is

equal,

P (si|G) =

Pgn

2G P (si|gn)n

(4.3)

Therefore according to 4.1,4.2 and 4.3,

argmaxG

P (S|G) = argmaxG

Y

si

2S

Pgn

2G P (si|gn)n

(4.4)

100

4. An Automated Method for Genotyping the Human Immunoglobulin HeavyChain Variable Region Locus

Figure 4.3: Number of allele di↵erences between pairs of individu-als based on high confidence automated IGHV genotyping. Each cellof the matrix contains the number of alleles predicted in the individual of thecorresponding row but not in the individual of the corresponding column. Cellscomparing twin genotypes are shaded.

122

Genotyping method evalua$on: number of allelic differences observed between pairs of samples. Purple squares correspond to samples from iden$cal twins or different $me points from the same individual pre-‐ and post-‐immune challenge

IgPdb A database of new Ig polymorphisms iden$fied using iHMMune-‐align

h:p://www.cse.unsw.edu.au/ihmmune/IgPdb

Haplotyping Use the associa$ons between alleles observed in rearranged sequences to infer the likely IgH haplotype of the subject. This requires the subject to be heterozygous at the IGHJ4 or IGHJ6 loci.

Sec$on of the phased IGHV haplotype of a subject obtained by applying mul$nomial logis$c regression to classify observed allele associa$ons. n one allele present on the chromosome n two alleles present (likely duplica$on) allele not present (likely dele$on)

n presence/absence of allele cannot be confirmed

5.Multin

omial

Logistic

Regression

fortheIdentifi

cationof

Immunoglob

ulin

Hap

lotypes

Figure 5.6: The IGHV Haplotypes of the Nine Individuals.(II) Pink rectangles indicate an allele of the geneis at present. Blue rectangles indicate that the gene at present has two alleles on the same chromosome. Yellowrectangles indicate that the presence of the allele on the chromosome cannot be confirmed. Rectangles withoutcolour filling indicated a deletion of the gene on the chromosome.

160

iHMMune-‐align High-‐accuracy HMM-‐based algorithm for iden$fying germline genes and muta$on events in rearranged Ig genes

iHMMune-‐align HMM topology 1. Overview. 2. Details

0 1 2 3 4 5 6 7 8 9

iHMMune-‐align

IMGT/VQUEST + JCTA

IgBLAST

Ab-‐origin

JOINSOLVER

SoDA

VDJSolver

IGHV (%)

IGHD (%)

IGHJ (%)

Percent incorrect assignment of alleles using a benchmark dataset of known genotype

Gaëta et al (2007) Bioinforma-cs 23:1580 Jackson et al (2010) Bioinforma-cs 26:3129 Chen et al (2010) Immunome Research 6 (Supp 1): S4

NeCTAR Genomics Virtual Laboratory

http://genome.edu.au

Goals:

• Community infrastructure for genome researchers

• Build on Australia’s Research Cloud

• Manage & analyze massive datasets

• ‘Galaxy’ workflow engine

• • ‘Science Collaboration Framework’ from Harvard

Sydney Computational Biologists Meetup

http://meetup.com/Sydney-Computational-Biologists

Goals:

• Meet every few months

• Talks from prominent bioinformaticians

• Hosted at Google office near CBD

• Opportunity to socialise

• Open to all

Special Issue on Visualizing Biological Data

http://nature.com/nmeth/journal/v7/n3s

Reviews of visualization tools for:

• Genome data

• Systems biology data

• Bioimage data

• Alignments & phylogenic data

• Macromolecular structures

VIZBI Conference Series & Website

http://www.vizbi.org

Next conference:

20-22 March 2013Broad Institute, Cambridge MA, USA

Website:

• 80 videos

• 254 posters

• > • > 32,000 unique visitors

Resources for Bioinformaticians& Computational BiologistsSeán I. O'Donoghue

1,2 ([email protected]), Christian Stolte

1 ([email protected]), & Kenny Sabir

2 ([email protected])

(1) CSIRO Mathematics & Information Sciences, Sydney & (2) Garvan Institute for Medical Research, Sydney

Explore linear and non-linear trends

Network visualization of correlated parameters

Kinase –substrate networks

Pleiotropy

Protein interaction networks

Proteomics

Transcriptomics

Transcriptional regulatory networks

Sequence-based

MIC

A Multi-dimensional Matrix for Systems Biology Research

1 2

3

How to find novel associations between ‘-omics’ parameters?

E-mail: [email protected]

Chi Nam Ignatius Pang, Apurv Goel, Simone S. Li, and Marc R. Wilkins

mailto:[email protected]

Identifying key insulin responsive pathways in plasma membrane trafficking of adipocytes

1 Garvan Institute of Medical Research, Sydney, Australia; 2 School of Mathematics and Statistics, University of Sydney.

Background

Methods

Contributions

Case study

With the recent explosion of high-throughput data many researchers have moved from studying a few molecules of interest to studying whole systems of molecules. The scale of high-throughput proteomics experiments can make both biological and statistical interpretation quite difficult due to the large number of proteins analysed and small number of replicates often performed. Analysing proteins in pathways (or any other group structure) can aid both these issues by increasing statistical power and increasing biological interpretability.

Pengyi Yang1,2, Shi-Xiong Tan1, Daniel Fazakerley1, Ellis Patrick2, James Burchfield1, Chris Gribben1, David James1, Jean Yang2

Mix lysates 1:1:1

Basal(light)

Insulin(medium)

Wortmannin(heavy)

Cationic colloidal silicaplama membrane

enrichment

Mix lysates 1:1:1

Basal(light)

Insulin(medium)

MK-2206(heavy)

Subcellular fractionationplama membrane

enrichmentIII

III IV

θ

Multi-dimensional pathway analysis

Two sets of quantitative mass spectrometry experiments were performed to interrogate insulin actions in plasma membrane proteome in3T3-L1 adipocytes.

In this first experiment, cells were SILAC labelledand either treated with insulin or Wortmannin (a PI3-kinase inhibitor) prior to insulin, or left untreat. Cell lysate were then extracted and mixed with a ratio of 1:1:1 and fractionated for plasma membrane enrichment. The second experiment adopted a similar design but usedMK-2206 (an Akt inhibitor) instead of Wortmannin.

This test can be used to test alternate hypotheses:

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Log2(Ins vs. Basal)

Log2

(Wor

t+In

s vs

. Bas

al)

0.0 0.2 0.4 0.6 0.8 1.0

−0.6

−0.4

−0.2

0.0

0.2

AP3B1

RAB5C

NAPA

HSPA8

AP3S1STX4

AP1B1

VAMP7SNAP23

ARF1

SNX9

TFRC

M6PR

STX12

VAMP2

VAMP8

STX6

STX7

STX8

IGF2R

AP1S1

SNX2

CLTCAP1G1

STX2

GLUT4

SORT1

Log2(Ins vs. Basal)

Log2

(Wor

t+In

s vs

. Bas

al)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5

−0.2

0.0

0.2

0.4

0.6

DNM2

M6PRNAPA

RAB5C

SNAPIN

SORT1

STX12

STX7 STX8

TFRC

TGOLN2TXNDC5

VAMP2

VAMP7

VAMP8

IGF2R

SNX2

STX16

STX6

GLUT4

Log2(Ins vs. Basal)

Log2

(MK+

Ins

vs. B

asal

)

Log2(Ins vs. Basal)

Log2

(MK+

Ins

vs. B

asal

)

PI3K

PIP3

Akt

PKC

GLUT4vesicle

IRS

PDK1

TBC1D1

AS160

wortmannin

MK

SNARE

?

Insulin receptor

Insulin

?

Using the revolutionary pathway analysis, weidentified on four tested directions significantlyregulated pathways. In particular, the testrevealed that several plasma membranetrafficking associated pathways have alteredexpressions in insulin and inhibitor treatments.

We subsequently validated several key proteins inthese pathways by immunobloting.

We then extracted the proteins that appear morethan three times in the top-10 pathways. The following figures show their location on the scatter plot. It was found that key proteins involved in the pathways are significantly blocked by PI3-kinase inhibitor Wortmannin, but are notblocked completely by Akt inhibitor MK-2206.

Step 3. The rotated quantiles are then convertedinto p-values and combined using Fisher’s method.

Step 4. A univariate pathway analysis is thenperformed on the combined p-values; we use Fisher’s method again to combine all the p-valuesof each protein in a pathway.

Insulin is a key hormone that dictates various cellular processes including cell survival, growth,and poliferation. Many actions of insulin are mediated via the activation of phosphatidylinositol 3 kinase (PI3-kinase), which will then activate several downstream signalling pathways including the serine/threonine AGC kinase Akt/PKB. In adipocytes, one major effect of insulin mediated activation of the PI3-kinase-Akt pathway is for the translocation of the insulin–responsive glucose transporter GLUT4 to the plasma membrane for glucose uptake. Although it is well known that insulin regulates many of these and other cellular processes, the action of insulin and the dependency on PI3-kinase and/or Akt is not fully understood.

1. Proteolytic Cleavage of SNARE Complex Proteins2. Clathrin Derived Vesicle Budding3. Botulinum Neurotoxicity4. Golgi Associated Vesicle Biogenesis5. Membrane Tra�cking6. Lysosome Vesicle Biogenesis7. NCAM1 Interactions8. Signaling by PDGF9. NCAM Signaling for Neurite out Growth10. Triacylglyceride Biosynthesis

Ins+Wort vs. BasalIns vs. Basal

1. Gene expression2. Cytosolic tRNA Aminoacylation3. Golgi Associated Vesicle Biogenesis4. Proteolytic Cleavage of SNARE Complex Proteins5. Botulinum Neurotoxicity6. Clathrin Derived Vesicle Budding7. Insulin Synthesis and Secretion8. Metabolism of Proteins9. Membrane Tra�cking10. Pyruvate Metabolism and TCA Cycle

Ins+MK vs. BasalIns vs. Basal

These results confirmed that the plasma membrane associated pathways are mainly inhibited by Wortmannin but only partially by MK-2206.

VAMP-2Syntaxin-16Syntaxin-6

B I I+W I+M0

0.4

0.8

1.2

PM le

vel

VAMP-2

Syntaxin-16

Syntaxin-6

Pan Cadherin

B I I+W I+MB I I+W I+MTCL PM

We also make use of a dual colour GLUT4 construct as a read out for GLUT4 traffickingevents and find the consistent trend as thoseobserved in mass spectrometry results andimmunobloting results.

GARVANINSTITUTE

THE UNIVERSITY OF

SYDNEY

Multi-dimensional Pathway Analysis

Time (min)

GLU

T4m

em(F

OB

)

GLU

T4m

em(F

OB

)

-10 0 10 20 30 400

2

4

6

0

2

4

6ConMKWort.

BasalInsulin

MKWort.

We propose a multi-dimensional pathway analysis methodology that extends all the benefits of traditional pathway analysis to situations where there are more than two biological conditions of interest. This methodology provides a framework that makes it conceptually easy to identify statistically altered pathways in a high-dimensional space.

Our methodology allows a researcher to hypothesise whether:

1) a set of proteins are over-expressed in treatment A and under-expressed in treatment B relative to basal.

2) a treatment instituted a change in the transcription of a set of genes but has had no impact on their translation into their corresponding proteins.

In our case study, we applied this method and identified several key insulin responsive pathways that are mediated by PI3-kinase and/or Akt signalling cascades.

controls

responding proteins

Consider a protein p in pathway P. Let t1 and t2 be test statistics for the protein p from comparisons between basal condition and treatments one and two, respectively. Our method for testing if a pathway P is differentially expressed with respect to an alternate hypothesis Ha is as follows:

Step 1. The test statistics are normalized to followthe quantiles of a normal distribution.

Step 2. These quantiles are projected into polarcoordinates which are then rotated such that thequantiles are orientated in the direction of thealternate hypothesis.

III

III IVTreat 1 vs. Basal

Trea

t 2 v

s. Ba

sal

θθ0

θπ/2

θπ

θπ3/2

X2 = -2 (ln(Px ) + ln(Py ))θ0 θ0X2 = -2 (ln(Px ) + ln(Py ))θπ/2 θπ/2

X2 = -2 (ln(Px ) + ln(Py ))θπ θπ

X2 = -2 (ln(Px ) + ln(Py ))θπ3/2 θπ3/2

PathwayDatabase

p p

pHa : t1 > 0, t2 > 0 p P p

pHa : t1 > 0, t2 < 0 p P p

pHa : t1 < 0, t2 > 0 p P p

pHa : t1 < 0, t2 < 0 p P p

Functional annotation of “missing” proteins - a bioinformatics approach Shoba Ranganathan1, 4, Javed M. Khan1, David E. James2 & Mark S. Baker3

1Department of Chemistry and Biomolecular Sciences and ARC Centre of Excellence in Bioinformatics, Macquarie University, Sydney, NSW, Australia 2Diabetes and Obesity Program, Garvan Institute of Medical Research, Sydney, NSW, Australia

3Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW, Australia 4Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore

Background Results for Chromosome 7

Future directions

Introduction: The Chromosome-centric Human Proteome Project (C-HPP) aims to systematically map all human

proteins. This mapping will lead to a knowledge-based resource defining the full set of proteins

encoded in each chromosome and laying the foundation for the development of a standardized

approach to analyse the massive proteomic data sets currently being generated. The neXtProt

database lists 20,111 proteins as the complete human proteome. However, several of these proteins

have no evidence at the proteomic or structural levels. For example, 203 (28%) proteins of human

chromosome 7 are considered “missing” as they lack experimental evidence. We have developed a

protocol for the functional annotation of these proteins by integrating several bioinformatics analysis

and annotation tools including protein domain mapping, Gene Ontology (GO) mapping, Kyoto

Encyclopedia of Genes and Genomes (KEGG) pathway analysis and interactome analysis.

Functional annotation of a large percentage of these “missing” proteins will be presented and these

annotations can be visualized using the ProteomeBrowser. This prototype generic methodology can

be extended to functionally annotate “missing proteins” from any species.

Aims: This project aims to functionally annotate the “missing proteins” in human proteome, initially focusing on

chromosome 7 (Chr7): • Identification of the list of Chr7 missing proteins

• Homology mapping of full length and protein domains against near neighbor species

• Categorization of the types of protein families representative in the list of missing proteins

with a focus on membrane proteins

• Assembling a list of key missing proteotypic (e.g. tryptic peptides) and familiotypic peptides (e.g.

domain and class specific peptides); and

• Obtaining high quality functional annotations for these “missing” proteins, thereby contributing to

the development of the ProteomeBrowser visualization tool.

Figure 2: BLASTP results show 170 hits against reviewed non-human

mammalian proteins, with 17 hits ≥ 98%, 37 ≥ 95%, 55 ≥ 90% sequence identity

and 33 proteins with no matches. The top three species matches were Mouse,

Rat and Bovine with 94, 18 and 16 hits, respectively.

Figure 1: A flowchart of the methodology used in our bioinformatics analyses. Phase I represents data

collection and pre-processing. Phase II depicts the various analyses carried out to obtain high quality

annotations for the missing proteins from each chromosome.

Figure 3. A (above): In silico tryptic digestion for the 33 missing proteins has

resulted in 586 peptides for further analysis to identify proteotypic and familiotypic

peptides. B (below): InterProScan has identified protein domain profiles for 167

of the 203 proteins for in-depth analysis and cross referencing with BLASTP

results.

Figure 5: Blast2GO results for the 203 proteins reveal that majority of the these proteins are; involved in binding (90), localized

within the cell (111) and take part in biological processes such as cellular process (92), biological regulation (68) and response to

stimulus (61).

Figure 4 (below): Results for BLASTP run against the PDB. This figure shows

the top 10 hits sorted by the highest identity and alignment length. Atleast two hits

have a high % identity (≥ 98%) and alignment length. In all 84 hits have been

retrieved. Further investigations into the actual lengths of query coverage will

unravel compelling prospects for homology modeling and other structural studies.

Figure 6: KEGG pathway analysis using KOBAS resulted in 178 matches out of which 83 and 88 proteins were identified with

most enriched pathway and disease terms, respectively. This figure shows the top 7 pathways with the maximum number of

proteins belonging to them, where Signal transduction and Olfactory transduction are the top two with 16 and 13 proteins,

respectively.

Methods

Cellular component Molecular function

Biological process

1. In-depth analyses by cross referencing of all the results.

2. Identification of potentially proteotypic and familiotypic peptides from the In silico tryptic digestion results for the 33

missing proteins for mAb production in collaboration with Monash Antibody Technologies Facility (MATF) and

proteomic identification from MS data.

3. Development of a generic semi-automated annotation pipeline for application to other human chromosomes as well

as proteomes of other organisms.

Acknowledgement

We are grateful to Mr. Gagan Garg, Macquarie University, for valuable bioinformatics assistance and the ANZ Chr7 C-HPP team for their support and suggestions.

Rupert ShuttleworthVictor Chang Cardiac Research Institute and Faculty of Engineering, University of New South Wales

Cloudbusting: Victor ChangCloudbusting: FAST ANNOTATIONDONE CHEAP

At the Epigenetics Lab at VCCRI we have been experi-menting with cloud computing to see if we can speed up our annotation pipeline. So far we are able to annotate at about 150,000 reads per minute, and it costs us less than a dollar to annotate a million reads this way.

There is an extra time overhead for uploading and downloading data between the remote computers, but by using compression and parallel connections we are able to upload a million reads in about a minute (and download several times faster.)

It is early days, and there are more opportunities for cheaper costs and increased speeds that we have not tried yet. But the forecast is looking good so far.Servers Workers Annotation time (min) EC2/EMR cost ($)

1 2 201 2.721 5 67 1.841 10 39 1.321 20 30 2.121 39 27 3.642 20 26 2.642 38 25 4.08

Hadoop Cluster (Amazon EMR/EC2)

Amazon S3

SAM file

Compressed SAM filecompress

Compressed SAM part 1

Compressed SAM part N

Compressed SAM part ...split

Compressed SAM file

Begin Send data to cloud Annotate data in cloud

SAM file

decompress Annotated SAM file part ...

Annotated SAM file part 1

annotateannotate

Compressed annotated SAM

file part ...


file part 1

compresscompress

Annotated SAM file part N

annotate


file part N

compress

Download data from cloud End


file part 1


file part ...


file part N

Annotated SAM file part 1

Annotated SAM file part ...

Annotated SAM file part N

Annotated SAM filemerge

uncompress

uncompress

uncompress

upload

download

OCAP Pipeline

We have developed an open comprehensive

analysis pipeline for iTRAQ (OCAP) to facilitate the

data analysis. The OCAP pipeline integrates a

number of our new algorithms, and includes all the

analysis components:

(1) a peak identification algorithm (DyWave) [1];

(2) an adapted protein identification algorithm

(X!Tandem) [2];

(3) a protein quantification algorithm (WQuant);

(4) a suite of visualisation tools for quality control and

exploratory analysis of the raw data.

DyWave utilises a dynamic wavelet-based peak

identification algorithm which simplifies the peak

identification process and improves the accuracy;

see Figure 2 for details. The incorporated X!Tandem

protein identification algorithm has been shown to

outperform SEQUEST and MASCOT on some

datasets [3]. WQuant achieves protein quantification

by dynamically identifying and extracting iTRAQ

signals from noise in a data-driven fashion.

OCAP Pipeline And A New Hybrid

Protein Identification Method

Penghao Wang1,2, Jean Yang1, Susan R. Wilson2,3,4

1. School of Mathematics and Statistics, University of Sydney, Australia

2. Prince of Wales Clinical School, University of New South Wales, Australia

3. School of Mathematics and Statistics, University of New South Wales, Australia

4. Mathematical Sciences Institute, Australian National University, Australia

Introduction

Tandem mass spectrometry-based iTRAQ protein

quantification enables the determination of relative

expression levels for thousands of proteins

simultaneously. This provides a powerful means for

identifying disease biomarkers.

iTRAQ data analysis is complicated and involves

several analytical stages. Unfortunately, there are

very limited comprehensive analysis pipelines

available. Existing pipelines consist of tools that are

usually separately designed and developed. This

makes the analysis procedure cumbersome and may

lead to sub-optimal results.

Acknowledgements

The OCAP pipeline was funded by Australian

Research Council (ARC) Project DP094267.

NovoDB was funded by National Health and Medical

Research Council (NHMRC) grant 525453.

NovoDB – A Hybrid Protein

Identification Method

De novo sequencing based protein identification is

the only feasible approach for finding new proteins

and is an effective method for studying protein post-

translational modifications. In order to further

increase the protein identification accuracy and

coverage, we have recently developed a new hybrid

protein identification method – NovoDB [5].

NovoDB differs from existing de novo sequencing

methods which rely on finding one maximum path

from a constructed spectrum graph. NovoDB applies

a novel Bayesian network and dynamic programming

hybrid algorithm to explore the sub-optimal solution

space. Thus NovoDB can better accommodate

various interferences and artefacts present in the

spectra. Evaluated on a large number of spectra,

NovoDB outperforms the most popular de novo

sequencing methods and can improve the accuracy

of de novo sequencing-based protein identification;

see Figure 6. We are currently working on extending

NovoDB to identify protein modifications and

incorporating it into OCAP.

References

[1] Wang P., et al. (2010) Bioinformatics, 26(18):

2242-2249.

[2] Craig R. and Beavis R. (2004) Bioinformatics,

20(9): 1466-1467.

[3] Balgley B.M., et al. (2007) Mol. Cell Proteomics,

6: 1599-1608

[4] Wang P., et al. (2012) Bioinformatics, 28(10):

1404-1405.

[5] Wang P., et al. (2012) BIOCOMP12, pp. 74-81.

[6] Keller A., et al. (2005) Mol. Syst. Bio., Epub2005.

OCAP is offered as both a standalone system and an

R package. The R version of OCAP provides a

convenient interface for downstream statistical

analysis that may significantly facilitate the iTRAQ

data analysis [4].

OCAP is able to generate results in either a fully

automatic or a stepwise manner. Under the

automatic mode, OCAP directly produces peptide

and protein level identification and quantification

results through a single function, which greatly

facilitates the data analysis. The analysis can also be

completed separately for each component, enabling

users to either perform separate analysis on

intermediate results or export the results to other

statistical software. The diagrammatic view of OCAP

and its main functionalities are given in Figure 3.

OCAP provides a series of visualisation tools, and

some examples are given in Figures 4 and 5.

Prince of Wales Clinical School

Figure 1. The typical analysis workflow of iTRAQ data analysis, including 3 main components: pre-processing; protein ID; and protein quantification.

Figure 3. Overview of the OCAP pipeline and its main analytical functionalities.

Figure 4. Peptide expression image plot for a specific protein.

Figure 5. Protein ID coverage and confidence plot. The identified peptides are marked in colours, otherwise marked in black.

Figure 2. Using Continuous Wavelet Transform, the peak identification procedure can be simplified compared with the traditional peak identification procedure.

Figure 6. The performance of NovoDB. The x-axis is the identified peptide length in number of amino acids, the y-axis is the identification accuracy.

Conclusion

Our OCAP pipeline can greatly facilitate the data

analysis for iTRAQ. Based on our results, OCAP

performs favourably compared with the TPP pipeline

[6]. However, the adapted X!Tandem protein

identification algorithm can be improved. Therefore

we have developed a hybrid protein identification

method to address the issue. The hybrid method

introduces a two-step identification framework. Using

the framework, we are currently developing new

methods that can detect protein modifications

simultaneously with the protein identification process,

so that the identification coverage can be

significantly improved.

26

T

Aim

Genomic analysis

Integration by pathway model

Integrate Genomic and Functional data

Pathway

Our group aims to apply computational and statistical methods to model biological and clinical questions related to cancer biology and translational research. In collaboration with the Pancreatic Cancer and Signal Transduction Groups at Garvan/TKCC, and Prof. Grimmond group in Queensland Centre for Medical Genomics, we are undertaking integrative analysis of multidimensional “-omics” datasets, generated by deep sequencing of the cancer genome and profiling of the transcriptome (the set of all messenger RNA molecules), epigenome (the inheritable changes regulating gene expression without alerting the underlying DNA sequence) and proteome (the entire set of proteins expressed or phosphorylated), with the aim of identifying candidate driver mutations and pathway aberrations in pancreatic cancer.

Cancer Bioinformatics

Integrative analysis of multiple –omics data for Pancreatic Cancer

Mark Cowley, Mark Pinese, Emily Stoddart, Roger Daly, Andrew Biankin, Jianmin Wu The Kinghorn Cancer Centre & Cancer Research Program, Garvan Institute of Medical Research, Sydney email: [email protected]

Proteomic analysis

Mia

Paca−2

Panc−1 X−

0203

Hs7

00T

Hs7

66T

SW19

90 Cap

an−1

Cap

an−2

X−03

27 X−04

03X−

0813

PL45

X−05

04H

PAC

AsPC

−1H

PAF−

II BxPC

3C

FPAC

−1SU

8686

2030

4050

60

No 1 frequent clustering result ( 83 times)

hclust (*, "complete")d

Hei

ght

Acknowledgements: We thank the Australian Pancreatic Genome Initiative (APGI) and all participating clinicians for their support and the high-quality samples used in this study.

Analysis platforms we developed

CategoryOvarianRhabdomyosarcomaColonMeningiomaLungNSCLCPancreasBreastEndometrialRenalCellCarcinoma

GBMEsophagealLungSCLCBladderMelanomaLiverOsteosarcomaGastricMultipleMyelomaLeukemia

ColonOrMelanoma

relative

- 2 0 2

Category

A2780_OVA

RYCA

OV3_O

VARY

CAOV4_O

VARY

COLO

704_OVA

RYCO

V362_O

VARY

COV434_O

VARY

COV504_O

VARY

EFO21_O

VARY

EFO27_O

VARY

HEYA

8_OVA

RYIGRO

V1_O

VARY

JHOC5_O

VARY

KURA

MOCH

I_OVA

RYNIHO

VCAR

3_OVA

RYOV90_OVA

RYOVC

AR4_OVA

RYOVC

AR8_OVA

RYOV

ISE_OV

ARY

OVM

ANA_OVA

RYRK

N_OVARY

RMGI_OVA

RYSN

U840_O

VARY

TOV112D_

OVA

RYTO

V21G

_OVA

RYTYKN

U_OVA

RYC2BBE1_LAR

GE_INTESTINE

COLO

205_LARG

E_INTESTINE

DLD1_LAR

GE_INTESTINE

GP2D_LARG

E_INTESTINE

HT29_LAR

GE_INTESTINE

HT55_LAR

GE_INTESTINE

HUTU

80_SMALL_INTESTINE

KM12_LAR

GE_INTESTINE

LOVO

_LAR

GE_INTESTINE

LS411N_LAR

GE_INTESTINE

LS513_LARG

E_INTESTINE

NCIH508_LARG

E_INTESTINE

RKO_LARG

E_INTESTINE

SKCO

1_LARG

E_INTESTINE

SNUC

1_LARG

E_INTESTINE

SNUC

2A_LAR

GE_INTESTINE

SW48_LAR

GE_INTESTINE

SW480_LARG

E_INTESTINE

ASPC

1_PANC

REAS

BXPC

3_PANC

REAS

CFPAC1_PAN

CREAS

HPAC

_PAN

CREAS

KP1NL_PANC

REAS

KP4_PANC

REAS

L33_PA

NCRE

ASMIAPA

CA2_PA

NCRE

ASPA

NC0327_PAN

CREA

SPA

NC0813_PAN

CREA

SPA

NC1005_PAN

CREA

SQG

P1_PAN

CREAS

SU8686_PAN

CREA

SJHESOA

D1_OESOP

HAGU

SKYSE150_OESOPH

AGUS

KYSE30_OESOP

HAGU

SKYSE450_OESOPH

AGUS

KYSE510_OESOPH

AGUS

OE33_OESOP

HAGU

STE15_OESOPH

AGUS

TE9_OE

SOPH

AGUS

TT_OESOP

HAGU

SA549_LUN

GHC

C2814_LU

NGHC

C364_LUN

GHC

C827_LUN

GNC

IH1650_LUN

GNC

IH1975_LUN

GNC

IH2122_LUN

GNC

IH661_LU

NGLN215_CE

NTRA

L_NE

RVOUS

_SYSTEM

LN229_CE

NTRA

L_NE

RVOUS

_SYSTEM

LN319_CE

NTRA

L_NE

RVOUS

_SYSTEM

LN464_CE

NTRA

L_NE

RVOUS

_SYSTEM

SF767_CE

NTRA

L_NE

RVOUS

_SYSTEM

U251MG_CEN

TRAL_NER

VOUS

_SYSTEM

NCIH196_LU

NGNC

IH2171_LUN

GNC

IH82_LUN

GA2058_SK

INHS

944T_SKIN

IGR39_SK

INCH

157M

N_CE

NTRA

L_NE

RVOUS

_SYS

TEM

F5_CEN

TRAL_NER

VOUS

_SYSTEM

IOMMLEE_CE

NTRA

L_NE

RVOU

S_SYSTEM

HCC70_BR

EAST

MDA

MB453_BRE

AST

AGS_STOMAC

HHU

G1N_STO

MAC

HX786O_KIDNE

YSLR21_KIDN

EYRT

112_UR

INAR

Y_TR

ACT

COLO

741_SK

INHE

C1A_EN

DOMETRIUM

HL60_HAE

MATOPO

IETIC_AN

D_LYMPH

OID_TISSU

EHLF_LIVER

KMS12BM_HAE

MATOPO

IETIC_AN

D_LYMPH

OID_TISSU

ESJSA1_BO

NEA204_SOFT_TISSU

E

TRCN0000047740m_st CHN2TRCN0000113889m_st CKAP5TRCN0000065048m_st STX12TRCN0000074654m_st EFTUD2TRCN0000004813m_st TERF2TRCN0000063406m_st KDELR3TRCN0000074878m_st PHF5ATRCN0000113840m_st TUBA1ATRCN0000009427m_st OR5V1TRCN0000038008m_st TPK1TRCN0000003995m_st TTC3TRCN0000043707m_st SLC9A3R2TRCN0000044288m_st CLCA2TRCN0000017525m_st SHOXTRCN0000003411m_st UBR5TRCN0000054009m_st PCDHB14TRCN0000038733m_st DMBT1TRCN0000021278m_st PAX8TRCN0000019677m_st MCM6TRCN0000051316m_st XPNPEP2TRCN0000004251m_st VCPTRCN0000072794m_st FXYD4TRCN0000053057m_st POLG2TRCN0000003844m_st CTNNB1TRCN0000006044m_st CSNK1A1TRCN0000078269m_st ADSLTRCN0000044394m_st CACNA2D4TRCN0000055742m_st OIT3TRCN0000029025m_st ACACATRCN0000083710m_st JUPTRCN0000072392m_st TUBG1TRCN0000006289m_st BRAFTRCN0000008744m_st MLL3TRCN0000008402m_st GIT1TRCN0000037710m_st ITPKBTRCN0000107748m_st BHLHE22TRCN0000054132m_st LRP1BTRCN0000054318m_st CLEC4FTRCN0000078219m_st LGALS17ATRCN0000006340m_st EXOSC10TRCN0000044069m_st SLC25A13TRCN0000007869m_st EIF2C3TRCN0000036210m_st PFASTRCN0000005555m_st NUP214TRCN0000000911m_st VRK3TRCN0000033262m_st KRASTRCN0000048630m_st RHOVTRCN0000013715m_st ATF2TRCN0000058927m_st IL15TRCN0000057860m_st CXCL11TRCN0000020384m_st SOX9TRCN0000061495m_st OR10A5TRCN0000006754m_st CSTATRCN0000013988m_st TACR1TRCN0000051291m_st CDATRCN0000010143m_st PIP5K1ATRCN0000074998m_st GPATCH1TRCN0000000219m_st RGS2TRCN0000008822m_st GPR128TRCN0000022092m_st ZNHIT3TRCN0000051946m_st EYA2TRCN0000113804m_st DOCK5TRCN0000052058m_st KLK15TRCN0000043942m_st KCNA1TRCN0000106875m_st H1F0TRCN0000073311m_st C1ORF124TRCN0000016560m_st DMRTA2TRCN0000018909m_st TLX2TRCN0000017326m_st ID4TRCN0000062982m_st CD300CTRCN0000053970m_st THBS2TRCN0000059052m_st MTSS1TRCN0000078302m_st CRY2TRCN0000021183m_st KLF10TRCN0000008781m_st DNAJB6TRCN0000001046m_st EGLN3TRCN0000007667m_st UBL4ATRCN0000029805m_st SIRPGTRCN0000011570m_st GPR4TRCN0000065103m_st SNX15TRCN0000034742m_st GLT25D2TRCN0000062436m_st RDXTRCN0000034203m_st MYO18ATRCN0000003220m_st TSSK2TRCN0000010243m_st ETNK1TRCN0000016524m_st LOC401361TRCN0000048049m_st RHOFTRCN0000044876m_st TRPM5TRCN0000082439m_st LOC389873TRCN0000007088m_st NEK8TRCN0000040209m_st SHC1TRCN0000006478m_st ARHGAP5TRCN0000046807m_st TMPRSS11ATRCN0000116260m_st MBPTRCN0000001232m_st SRPK1TRCN0000073764m_st IMP5TRCN0000007271m_st PSMB8TRCN0000003271m_st SYT4TRCN0000016218m_st ZNF333TRCN0000000692m_st CASKTRCN0000061205m_st ZP4TRCN0000038806m_st RNF7TRCN0000058660m_st TNFRSF9TRCN0000029544m_st ATP6V1B2TRCN0000072438m_st KIAA0090TRCN0000056243m_st ANXA3TRCN0000016520m_st ZNF792TRCN0000038894m_st BRMS1TRCN0000001311m_st SREK1TRCN0000029661m_st MARCOTRCN0000022323m_st PEX10TRCN0000051731m_st RNPEPTRCN0000018986m_st SOX10TRCN0000013695m_st CEBPDTRCN0000017452m_st SOX18TRCN0000029761m_st PITPNM1TRCN0000038591m_st ATP2A3TRCN0000082761m_st FRMD1TRCN0000014284m_st VN1R5TRCN0000039682m_st CDKN1CTRCN0000051770m_st DNM1TRCN0000016799m_st OSR1TRCN0000045929m_st FPGSTRCN0000029808m_st SIRPGTRCN0000078217m_st CLPSTRCN0000049515m_st ASAH2TRCN0000053398m_st EHD4TRCN0000116121m_st GBP1TRCN0000003080m_st PDP1TRCN0000082465m_st PRKAR1BTRCN0000073661m_st CPAMD8TRCN0000060450m_st LILRP2TRCN0000062662m_st AGERTRCN0000003249m_st PTPRN2TRCN0000043995m_st KCNV2TRCN0000052570m_st MOBKL2CTRCN0000034183m_st TRIM74TRCN0000016054m_st LOC341415TRCN0000074115m_st GJB4TRCN0000042945m_st SLC12A2TRCN0000002006m_st PRKACBTRCN0000052244m_st MMP24TRCN0000061243m_st ITPR1TRCN0000074211m_st JUBTRCN0000016845m_st PRDM10TRCN0000022197m_st TTF1TRCN0000000247m_st IDETRCN0000018260m_st ZNF354CTRCN0000060764m_st GABARAPL1TRCN0000083069m_st COL11A2TRCN0000051975m_st SKIV2L2TRCN0000033746m_st TRIM3TRCN0000058770m_st LIFRTRCN0000056116m_st HRCTRCN0000045282m_st MRRFP1TRCN0000037596m_st CDKN2CTRCN0000000612m_st CSNK2A2TRCN0000045873m_st ASNSTRCN0000056789m_st CD80TRCN0000004443m_st USP33TRCN0000060773m_st CHRNA10TRCN0000054007m_st MYL3TRCN0000021020m_st GTF2H3TRCN0000007322m_st FJX1TRCN0000013834m_st EGR1TRCN0000074736m_st RBM10TRCN0000003176m_st ARHGEF2TRCN0000005392m_st GATA6TRCN0000018562m_st HOMEZTRCN0000029606m_st MS4A2TRCN0000037431m_st SPEGTRCN0000008355m_st FZD9TRCN0000051861m_st PLD6TRCN0000059949m_st SYTL2TRCN0000003897m_st PSMA7TRCN0000050269m_st RPUSD1TRCN0000055833m_st PRRG3TRCN0000047850m_st RHOBTRCN0000057935m_st CXCL5TRCN0000017052m_st ZFP28TRCN0000003125m_st FASNTRCN0000078495m_st OAZ1TRCN0000017357m_st MBD3TRCN0000015972m_st MEIS1

!

!

!

!

!!

!

!

!!!

!

!

!!!!

!

!!!!!

!

!!

!

!!!

!!

!

!!!!!

!

!

!

!

!

!!!!!

!

!!

!

!!!!!!!!

!

!

!!!!!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!!

!

!

!

!

!

!

!

!!

!

!!

!

!

!!

!

!

!

!!

!

!

!!!!

!!

!!

!

!!

!

!

!!!

!

!

!!

!!!!

!

!

!!

!!!!

!!!!!

!

!

!

!!

!!!

!!

!!

!

!!!!

!!

!

!!!!

!!!!

!

!

!!!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!!

!

!

!!

!

!!

!

!

!

!

!

!!

!

!!!

!

!!!

!!!

!

!

!

!

!

!!

!

!

!!!!!!!

!

!!

!

!!

!

!

!

!

!!

!!!!!

AsPC

−1

BxPC

3

Capan−1

Capan−2

CFPAC

−1

HPAC

HPAF−II

HPD

E

HPD

E_KR

asV12D

Hs700T

Hs766T

MiaPaca−2

Panc−1

PL45

SU8686

SW1990

X−0203

X−0327

X−0403

X−0504

X−0813

X−1005

5

10

15

20

25

Orbi_pYdataset_normalized_imputed_complete_log2

The left figure is a word cloud of the genes with somatic mutations identified by deep sequencing of tumor and normal tissues from 99 pancreatic cancer patients. The size of texts corresponds to the frequency of mutations in the patient cohort. The right figure shows the summary of copy number changes, differential gene expression, somatic coding mutations and structural aberrations for a pancreatic cancer.

The top left figure shows the boxplot of intensities for 22 pancreatic cancer cell lines. The top right figure is the most stable phenotypic subtype using a stepwise hierarchical c lustering approach we devised. The left figure l i s t s t h e t o p 3 0 i m p o r t a n t phosphorylation sites ranked by the variable importance score from a Random Forest model we built using drug response data. The right figure shows the protein-protein interactions among the proteins with these important phosphorylation sites.

By integrating somatic muta t i ons and copy number variations, we identified Axon Guidance p a t h w a y g e n e s a r e enriched in the APGI cohort. In addition, by c omb ing exp r e s s i on prof i l ing and patient outcome data, we found low expression of the ROBO2 receptor was associated with poor patient survival, while h i g h e x p r e s s i o n o f ROBO3, a known inhibitor o f ROBO2 s igna l l ing d e m o n s t r a t e d a n appropriate reciprocal inverse association with poor survival.

To investigate the potential functional consequences of mutations and copy number variant genes in our cohort, we integrated our dataset with a large scale in vitro shRNA functional screen (data from Cheung et al., 2011), and two recent in vivo sleeping beauty transposon mediated insertion mutagenesis screens (Mann et al. and Perez-Mancera et al.), that were recently published. The right figure is a heatmap of the highest confidence cancer-lineage-specific oncogenes.

protein interaction network construction, filtering, analysis, visualization and management. It has been published in Nature Methods and Nucleic Acids Research and users are all over the world. The InterOmics platform (right figure) is a platform developed for integrating multi "-omics" data, and has been used in the ICGC Pancreatic Cancer project.

Our group developed and published several popular analysis platforms. Protein Interaction Network Analysis (PINA) platform (left figure) is an integrated platform for

This work is a part of the Interactional Cancer Genome Consortium (ICGC) Pancreatic Cancer project, lead by Australian Pancreatic Cancer Genome Initiative (APGI).

Re-Fraction: a machine learning R package for deterministic identification of protein homologues and slice variants in large-scale MS-based proteomics

Background

Methods

Avaliability

Aims

Results

Bottom up mass spectrometry (MS)-based proteomics relies on the identification of enzymatically digested peptides and subsequently infer potential proteins that could present in the sample based on the observed peptide sequences. This process is known as “protein inference”. A key challenge in protein inference is that a high percentage of identified peptides are shared among multiple proteins. This results in ambiguity in determining the exact identity of proteins present in the sample.

In particular, the shared peptides are especially common among protein homologues and splice variants making deterministic identification of these proteins a nontrivial task.

Design a computational approach to resolve ambiguity in peptide assignment and therefore, accurately distinguish proteins and their homologues and/or slice variants expressed inthe sample.

In proteomics studies, the sample complexity canbe reduced by fractionating proteins using SDS-PAGE prior to LC-MS/MS analysis.

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

020

040

060

080

0

Protein Mass

Mas

s (k

Da)

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

010

0030

0050

0070

00

Leng

th (a

a)

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

24

68

Log2

(Num

ber o

f Pep

tides

)

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

46

8

Isoe

lect

ric P

oint

Protein Length

Gel Fractions Gel Fractions

Gel Fractions Gel Fractions

Number of trypic peptides

1012

0 200 400 600 800

02

46

810

Reg

ress

ion

Val

ue

* colour corresponds to the actual gel fraction in which a protein is found

SVM regression w.r.t gel fractions using protein mass

Protein Mass (kDa) Mass

pI

Pept

ide

By applying Re-Fraction to a large-scaleproteomics data generated from the 3T3-L1plasma membrane proteome study in our lab,we showed that the algorithm can accurately assign each protein to its correspondingfraction.

Firstly, we evaluated the performance of themodel using 10-fold cross validation on thetraining data where the peptides are uniquelyassigned to a single protein by using MaxQuantsoftware. We calculated accuracy (ACC), sensitivity (SE) and specificity (SP) as follows:

“Big” proteins

“Small” proteins

SDS-PAGE

By utilising four protein physical properties aslearning features, we can model from which gel fraction each protein should be found. The following shows the predictive power of each feature alone on capturing the seperation of protein to gel fractions, respectively.

GARVANINSTITUTE

THE UNIVERSITY OF

SYDNEY

A support vector machine (SVM) regression model is applied to build a classifier using the four features described above. This is then followed by assigning each protein from a given protein database to their expected gel fractions.

The following figure shows the regression resulton using the learning feature of protein massalone and the combination of three features,respectively. By combining more features, we canimprove the model performance.

Since the fraction from which a peptide was identified is known, this information can be used to prevent the peptide from being assigned to unlikely or incorrect proteins based on their physical properties, even if all putative proteins in the protein group contain the same observed peptide sequences. We call this procedure“Re-Fraction”.

F1 0.981 0.824 0.999F2 0.988 0.936 0.994F3 0.986 0.944 0.992F4 0.967 0.937 0.975F5 0.955 0.863 0.968F6 0.957 0.801 0.978F7 0.946 0.753 0.968F8 0.952 0.628 0.975F9 0.964 0.722 0.975F10 0.981 0.737 0.996

ACC SE SP

After applying Re-Fraction, on the peptide level, we assign 2424 more unique peptides to their corresponding proteins. This acount for a 16%increase compared to the original result withoutapplying Re-Fraction.

On the protein level, we deterministically identify256 more proteins, which are roughly equallydistributed from each fraction in percentage.

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

Additional (Re−Fraction) Original

Fraction

Num

ber o

f Det

erm

inis

tic P

rote

in Id

enti�

catio

ns

050

100

150

200

250

300

Original Re−Fraction

Num

ber o

f Det

erm

inis

tic P

rote

in Id

enti�

catio

ns

020

040

060

080

0

693

949

40 kDa

As a validation, Re-Fraction was able to distinguish RagA which are expressed in 3T3-L1but not RagB. Using immunoblotting, we confirmed this finding.

We have implemented Re-Fraction as a R packageand it could be applied to any proteomics datawith SDS-PAGE fractionation. The project homepage address is:

http://code.google.com/p/re-fraction/

The project homepage contains the R package ofRe-Fraction, the source code, the test datasets.Please following the steps below to install and usethe Re-Fraction R package:

1 Garvan Institute of Medical Research, Sydney, Australia; 2 School of Mathematics and Statistics, University of Sydney.Pengyi Yang1,2, Sean Humphrey1, Daniel Fazakerley1, Ma�hew Prior2, Guang Yang1, David James1, Jean Yang2

SE =TP

TP + FNSP =

TNTN + FP

ACC =TP + TN

TP + TN + FP + FN

where TP, TN, FP, and FN are true positives,true negatives, false positives, and false negatives, respectively.


Shared PeptideUnique Peptide

Num

ber o

f Pep

tides

050

0010

000

1500

020

000

5177

9742

7601

7318


Perc

enta

ge o

f Uni

que

Pept

ide

(%)

010

2030

4050

35%

51%

30 kDa

R CMD INSTALL ReFraction_0.2.tar.gz

(1) after downloading the package (current version: ReFraction_0.2.tar.gz), install the package on console as follows:

(2) Open an R window and load the package asfollows:

library(ReFraction)

(3) type the following to see the generalinformation:

?ReFraction

(4) type the following to see the applicationexample:

?applyReFraction

(5) extract protein properties from a fasta file:

extractDatabase(path to the fasta file)

Please visit the project homepage for more usagedetails and examples.

Sydney Bioinformatics Research Symposium 2012 posterbook

Technology

Transcript of Sydney Bioinformatics Research Symposium 2012 posterbook