Getting genomics and proteomics data to work together - Jason Wong
-
Upload
australian-bioinformatics-network -
Category
Health & Medicine
-
view
216 -
download
0
description
Transcript of Getting genomics and proteomics data to work together - Jason Wong
![Page 1: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/1.jpg)
Getting genomics and proteomics
data to work together
Prince of Wales Clinical School
Dr Jason Wong
![Page 2: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/2.jpg)
State of the art in proteomics
Proteomics can now be use to identify and quantify tens of thousands
of proteins in a single experiment.
Nagaraj et al Mol Sys Biol 2011
HeLa cells: 10,255 proteins identified
Zhou et al Nat. Comm. 2013
mESC: 11,352 proteins identified
Mertins et al Nat. Met. 2013
Jurkat cells: 7,897 proteins
![Page 3: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/3.jpg)
Challenges of proteomics
• Experimental perspective
• Obtaining sufficient sample
• Sample preparation
• Dynamic range
• Computational perspective
• Risk of false positive identification
• General methods only identifies known proteins
Wong et al BMC Bioinf 2007
Annotated spectra (~30%)
High quality potentially
annotatable spectra (~20%)
Non-
peptide/low
quality
spectra
(~50%)
![Page 4: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/4.jpg)
Genomics and transcriptomics
Analysis of DNA/RNA does not have many of the limitations of proteomics,
especially with the emergence of next-generation sequencing (NGS).
•Sample quantity less of an issue when analysing DNA/RNA.
•Very large dynamic range with NGS.
•Relatively simple sequence-based data analysis.
Next-generation sequencing has allowed the discovery of:
1.Single nucleotide variants/Indels (Exome-seq/RNA-seq)
2.Novel splice variants (RNA-seq)
3.Novel proteins (Ribosome profiling)
However, in order to understand the functional importance of coding
genes, it is still essential to study them at the protein level.
![Page 5: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/5.jpg)
Single nucleotide variants – Jurkat cells
Datasets
Experiment Details Reference
Exome-seq ~ 150 M, 100 bp PE reads Broad Institute, CCLE
RNA-seq ~ 100 M, 100 bp PE reads Sheynkman et al
MCP (2013)
Proteomics deep ~ 0.5 M spectra Sheynkman et al
MCP (2013)
Proteomics ultra-deep ~ 2.5 M spectra Mertins et al
Nat Meth (2013)
Proteomics ultra-deep
PTM
pSTY - ~ 0.85 M spectra
(ac)K - ~ 0.35 M spectra
(ubi)K - ~ 0.36 M spectra
Mertins et al
Nat Meth (2013)
![Page 6: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/6.jpg)
Searching for peptides with SNVs
BAM
VCF
Annotated
variants Variant
peptides
GATK/samtools
ANNOVA
Python
scripts
Exome/RNA-seq
Mass spectra
Refseq
Annotated mass
spectra
Search using MaxQuant
+
Proteomics
![Page 7: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/7.jpg)
Variants – Overlap between Exome- and RNA-seq
Exome-seq RNA-seq
8232 4584 1975 Non-synonymous
variants
Almost 70% of RNA-seq
n.s. variants overlap with
Exome-seq.
![Page 8: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/8.jpg)
Variants – Overlap with proteomics data
RNA-seq
(Total variants
6559) 638 349 99
Mertins dataset Sheynkman dataset
525 290 81
Exome-seq
(Total variants
12816)
• Suggests that RNA-seq may be
more suited for finding variants in
proteomics data.
• However may also just be just due
to data quality issues.
Variant
peptides
Total
peptides
Mertins 987 156,606
Sheynkman 448 75,878
RNA-seq based variants validated by
mass spectrometry
![Page 9: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/9.jpg)
Validation of peptide identifications
Variant
peptides
Reference
peptides
Heterozygous 673 465
Homozygous 314 4
Chr Pos Ref Alt Zygosity Qual Depth Depth Alt Func.refGene Gene.refGene
chr10 51363659 A G 1 222 222 200 exonic PARG
chr10 71906150 T C 1 44.8 2 2 exonic TYSND1
chr11 67414492 C T 1 43.8 2 2 exonic ACY3
chr19 3492265 G A 1 59 3 3 exonic DOHH
R e f e r e n c e a b u n d a n c e ( l o g 1 0 )
Va
ria
nt a
bu
nd
an
ce
(lo
g1
0)
4 5 6 7 8 9
4
5
6
7
8
9r2=0.19
Mertins dataset
![Page 10: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/10.jpg)
Validation of peptide identifications
Ma
xQ
ua
nt s
co
re
Va
r i an
t p
ep
t id
es
Re
f er e
nc
e p
ep
t id
es
Al l
pe
pt i
de
s
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0n . s .
n . s .
0- 5
0
50
- 10
0
10
0- 1
50
15
0- 2
00
20
0- 2
50
25
0- 3
00
30
0- 3
50
35
0- 4
00
40
0- 4
50
45
0- 5
00
50
0- 5
50
55
0- 6
00
60
0- 6
50
65
0- 7
00
70
0- 7
50
75
0- 8
00
80
0- 8
50
85
0- 9
00
90
0- 9
50
95
0- 1
00
0
>1
00
0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
%v
ar
ian
ts
V a r i a n t s i d e n t i f i e d b y M S
A l l v a r ia n t s
R e a d d e p t h
![Page 11: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/11.jpg)
Variants in application to PTMs
Variant peptides Total peptides
Phosphorylation STY(p) 357 64067
Acetylation K(Ac) 2 5805
Ubiquitination K(GG) 172 38454
(1) Variant residue not affecting
phosphorylation site
95% (339)
(2) Variant residue is
phosphorylation site
3.4% (12)
(3) Variant residue may
influence phosphorylation
1.6% (6)
How does the variant affect phosphorylation?
(1) EILpSPQ(W/C)Y
(2) EIL(A/pS)PQWY
(3) EIL(p)S(G/P)QWY
![Page 12: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/12.jpg)
Phosphorylation sites directly affected by variants
Gene Refseq Variant Peptide SNP ID SIFT score Polyphen score
GBF1 NM_001199378 p.G1690S GGSPSALWEITWER rs11191274 0.75 0.703
LINS NM_001040616 p.R680S EFSLEPPSSPLVLK rs8451 0.36 0.023
TADA2A NM_001166105 p.P6S LGSFSNDPSDKPPCR rs7211875 1 0
TCF3 NM_001136139 p.A8S MSPVGTDKELSDLLDFSMMFPLPVTNGK rs147133056 0.05 0.997
USE1 NM_018467 p.L154S TGVAGSQPVSEKQSAAELDLVLQR rs414528 0.82 0
ATP5SL NM_001167867 p.N40S LGAAVAPEGSQKK rs2231940 0.82 0.018
TANC1 NM_001145909 p.N250S SGSSLEWNKDGSLR rs12466551 1 0
SF3B1 NM_001005526 p.A86T KPGYHTPVALLNDIPQSTEQYDPFAEHRPPK NA 0.01 0.955
DDX27 NM_017895 p.G766S QYRASPSFEER rs1130146 0.37 0.983
FAM114A2 NM_018691 p.G122S AETSLGIPSPSEISTEVK rs2578377 1 0
SPDL1 NM_017785 p.L586S SHPILYVSSK rs3777084 0.89 0
GPANK1 NM_033177 p.A78S IMKSPAAEAVAEGASGR NA 0.72 0.005
PARG NM_003631 p.L138P LENVSQLSLDKSPTEK rs4412715 NA NA
KIAA0586 NM_001244193 p.L703P EASPPPVQTWIK rs1748986 0.22 0.001
DMXL2 NM_001174116 p.S1288P FGDTEADSPNAEEAAMQDHSTFK rs12102203 0.21 0
GGA2 NM_015044 p.A424P NLLDLLSPQPAPCPLNYVSQK rs1135045 0.48 0
PFAS NM_012393 p.L621P NGQGDAPPTPPPTPVDLELEWVLGK rs11078738 1 0
ZNF235 NM_004234 p.H296P SPACSTPEKDTSYSSGIPVQQSVR rs2125579 0.02 0.001
Creation of
phospho-site
(XS/T)
Creation of
MAPK motif
(S/TX S/TP)
![Page 13: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/13.jpg)
Splicing factor 3B subunit 1 (SF3B1)
• Part of the RNA splicing machinery
• Frequently mutated in myleodysplasia (~20%) and other leukaemias
• A86T has been identified previously in one lung cancer sample from
TCGA.
• SF3B1 is a phosphoprotein and phosphorylation of SF3B1 is known to
be important for the assembly of the RNA splicing machinery.
A86T located in exon3
![Page 14: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/14.jpg)
Using mass spectrometry to discover new proteins
Prediction of alternative ORFs from RNA-seq
• 86 alternative ORFs
• 57% non-AUG translation initiation
Computational prediction of alternative ORFs
• Only AUG translation initiation
• 1,259 alternative ORFs!
Po
st
er
ior
Er
ro
r P
ro
ba
bil
ity
(P
EP
)
Se
r um
aO
RF
Se
r um
rO
RF
He
La
aO
RF
He
La
rO
RF
He
La
< 1
0 k
Da
aO
RF
He
La
< 1
0 k
Da
rO
RF
0 . 0 0
0 . 0 2
0 . 0 4
0 . 0 6 * * * * * * * * * * * *
![Page 15: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/15.jpg)
Ribosome profiling
Ingolia et al. Cell 2011
• Sequence only mRNA protected
by ribosomes
• Results in identification of mRNA
that is being translated
• Use of translation inhibitors
enables discovery of newly
ORFs
![Page 16: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/16.jpg)
Application to HEK293T cells
Lee et al PNAS 2012
o Reported 12,814 alternative ORFs in refseq genes
o Using their annotation, constructed database of proteins arising from these
alternative ORFs.
o Searched against publically available HEK293T datasets
• Geiger et al MCP 2012 , ~0.5 million spectra
![Page 17: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/17.jpg)
RefSeq Accession Gene Symbol Relative to rTIS Annotation Frame ORF length Codon Peptide count
NM_019008 SMCR7L -311 5'UTR 1 213 ATG 3
NM_080670 SLC35A4 -719 5'UTR 1 312 ATG 2
NM_001142726 C1orf122 -609 5'UTR 0 753 TTG 2
NM_004860 FXR2 -219 5'UTR 0 2241 GTG 2
Identified novel proteins
NM_080670_SLC35A4_10_5'UTR (chr5:139,944,429-139,946,345)
NM_019008_SMCR7L_188_5'UTR (chr22:39,900,236-39,900,445)
C o n s e r v a t i o n
Pe
rc
en
ta
ge
of r
eg
ion
s0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
5 'U T R
C o d i n g
(11 novel ORFs in total)
![Page 18: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/18.jpg)
Conclusion and next steps
Variants
•Analysis indels and more datasets
Novel proteins
•Develop methods to analyse ribosome profiles in unannotated genomic
regions.
• Generate HEK293T peptidomics (< 10kDa) dataset
In terms of bioinformatics:
•Automate integration of transcriptomics and proteomics data.
•Methods to visualise identified peptides on genome browsers.
•Develop new MS-based search methods of directly finding variant peptides.
![Page 19: Getting genomics and proteomics data to work together - Jason Wong](https://reader033.fdocuments.in/reader033/viewer/2022042816/559b47ca1a28abaf2e8b4671/html5/thumbnails/19.jpg)
Acknowledgements
Bioinformatics and Integrative genomics team
• Dr Ranjeeta Menon
• Dr Dominik Beck
• John Ng
• Jackie Huang
• Kate Guan
• Felix Ma
• Dilmi Perera
• Diego Chacon
Carnegie Institution for Science, Baltimore, USA
• Dr Nicolas Ingolia
Funding: