Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of...
-
Upload
dwain-gregory -
Category
Documents
-
view
217 -
download
0
Transcript of Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of...
Gene Set Enrichment and Splicing
Detection using Spectral Counting
Gene Set Enrichment and Splicing
Detection using Spectral Counting
Nathan EdwardsDepartment of Biochemistry and Mol. & Cell. BiologyGeorgetown University Medical Center
Outline
• Systems Biology• Gene Sets & Functional Enrichment• Balls in Urns
• Proteomics• MS/MS and Peptide ID• Quantitation and Spectrum Counting
• Differential Protein Abundance• Detecting Splicing and Isoforms
2
Systems Biology
3
MathematicalModels
KnowledgeDatabases
High-ThroughputExperiments
Systems Biology
4
MathematicalModels
KnowledgeDatabases
High-ThroughputExperiments
• Sequencing• Microarrays• Proteomics• Metabolomics
molecular biology ↕
phenotype
Systems Biology
5
MathematicalModels
KnowledgeDatabases
High-ThroughputExperiments
• UniProt• OMIM• Kegg
molecular biology↕
biology
Systems Biology
6
MathematicalModels
KnowledgeDatabases
High-ThroughputExperiments
• Software • Statistics• Algorithms
phenotype↕
biology
Systems Biology
7
MathematicalModels
KnowledgeDatabases
High-ThroughputExperiments
• Software • Statistics• Algorithms
phenotype↕
biology
• UniProt• OMIM• Kegg
molecular biology↕
biology
• Sequencing• Microarrays• Proteomics• Metabolomics
molecular biology ↕
phenotype
Gene Expression Analysis
• Differential expression via:• Structured experiments• Transcript measurements• Statistics
• But now what?
8
Gene Expression Analysis
Hengel et al. J Immunol. 2003.•Structured experiment:
• CD4+/L-selectin- T-cells, vs• CD4+/L-selectin+ T-cells
•Affymetrix Human Genome U95A Array•Processing & Statistics
• MAS 4.0, t-Tests, FDR filtering, …•164 probe identifiers for upregulated genes.
9
Gene Expression Analysis
10
34529_AT 38816_AT 679_AT 37105_AT 34623_AT 36378_AT 35648_AT 33979_AT 34529_AT 1372_AT 38646_S_AT 35896_AT 34249_AT 40317_AT 32413_AT 33530_AT 32469_AT 34720_AT 36317_AT 31987_AT 33027_AT 35439_AT 36421_AT 966_AT 967_G_AT 31525_S_AT 38236_AT 34618_AT 34546_AT 31512_AT 40959_AT 38604_AT 33922_AT 40790_AT 35595_AT 33963_AT 33685_AT 35566_F_AT 33684_AT 36436_AT 37166_AT 34453_AT 1645_AT 39469_S_AT 38229_AT 38945_AT 37711_AT 39908_AT 1355_G_AT 38948_AT 1786_AT 39198_S_AT 606_AT 35091_AT 35090_G_AT 37954_AT 822_S_AT 36766_AT 37953_S_AT 38128_AT 40350_AT 37097_AT 33516_AT 38691_S_AT 34702_F_AT 31715_AT 1331_S_AT 34577_AT 33027_AT 38508_S_AT 32680_AT 39187_AT 31506_S_AT 31793_AT 40294_AT 40553_AT 1983_AT 32250_AT 37968_AT 33293_AT 40271_AT 32418_AT 33077_AT 38201_AT 2090_I_AT 34012_AT 34703_F_AT 38482_AT 40058_S_AT 34902_AT 34636_AT 41113_AT 35996_AT 40735_AT 34539_AT 41280_R_AT 37061_AT 34233_I_AT 41703_R_AT 37898_R_AT 35373_AT 37408_AT 35213_AT 31576_AT 39094_AT 32010_AT 919_AT 1855_AT 1391_S_AT 34436_AT 33371_S
Gene Expression Analysis
11
1112_g_at neural cell adhesion molecule 1
1331_s_at tumor necrosis factor receptor superfamily, member 25
1355_g_at neurotrophic tyrosine kinase, receptor, type 2
1372_at tumor necrosis factor, alpha-induced protein 6
1391_s_at cytochrome P450, family 4, subfamily A, polypeptide 11
1403_s_at chemokine (C-C motif) ligand 5
1419_g_at nitric oxide synthase 2, inducible
1575_at ATP-binding cassette, sub-family B (MDR/TAP), member 1
1645_at KiSS-1 metastasis-suppressor
1786_at c-mer proto-oncogene tyrosine kinase
1855_at fibroblast growth factor 3 (murine mammary tumor virus integration site (v-int-2) oncogene homolog)
1890_at growth differentiation factor 15
… …
Gene Set Enrichment
• Candidate genes are “special” with respect to the experiment structure (phenotype)
• Are they special with respect to general biological knowledge?• Are the candidate genes related?• Can we filter out the noise?• Can we expose associated genes?• What genes' changes are linked to the
experimental structure / phenotype?12
Gene Sets
• Genes may be related in many ways:• Same pathway, similar function, cellular location• Cytoband, identified in previous study, etc.
• Define gene sets for relatedness• GO Biological Process• GO Molecular Function• GO Cellular Component• KEGG Pathway, Biocarta Pathway• Biological knowledge databases
13
Gene Set Enrichment
14
Gene Set Enrichment
15
Gene Set Enrichment
16
Drawing Balls from Urns
17
1000 Balls, 900 Red, 100 Blue.
Drawing Balls from Urns
18
100 Balls Drawn at Random? # Red? # Blue?
Drawing Balls from Urns
19
How surprising is 5, 10, 15, 20, … blue?
Drawing Balls from Urns
20
How surprising is 30, 50, 70, … blue?
Drawing Balls from Urns
21
6 of 155 upregulated genes have"oxygen binding" GO annotation!
All human genes ( = 25), blue is oxygen binding.
How surprised should we be?
• Classic problem in probability theory• How well do the observed counts match the
expected counts?• Various mostly equivalent statistical tests
are applied:• Fisher exact test• Hypergeometric• Chi-Squared (χ2)
• p-value measures "surprise".
22
23
Proteomics
• Proteins are the machines that drive much of biology• Genes are merely the recipe
• The direct characterization of proteins en masse. • What proteins are present?• How much of each protein is present?• Which proteins change in abundance?
24
Sample Preparation for Tandem Mass Spectrometry
Enzymatic Digestand
Fractionation
25
Single Stage MS
MS
26
Tandem Mass Spectrometry(MS/MS)
MS/MS
27
Peptide Fragmentation
K1166
L1020
E907
D778
E663
E534
L405
F292
G145
S88 b ions
100
0250 500 750 1000
m/z
% I
nte
nsit
y
147260389504633762875102210801166 y ions
y6
y7
y2 y3 y4
y5
y8 y9
b3
b5 b6 b7b8 b9
b4
LC-MS/MS
• Powerful combination of liquid chromatography (LC), and
• Tandem mass-spectrometry (MS/MS)
• Automatically collect 100k MS/MS spectra in an afternoon• Tens of thousands of peptide/spectra
assignments, • Thousands of proteins identified
28
Spectral Counting
• Abundant proteins are more likely to be identified:• Selection (by the instrument) for
fragmentation is based on intensity• More abundant ions are more likely to
fragment in an informative manner• A proteins' peptide identification count
(spectra) can be used as a crude abundance measurement. • Easy, cheap, (relative) protein quantitation
29
Differential Spectral Counts
• Spectral counts are too crude for classical (microarray) statistics.• Fold change, t-tests, …
• However, we expect "similar" spectral counts when the protein abundance is unchanged.• Recast as drawing balls from urns.
30
HER2/Neu Mouse Model of Breast Cancer
• Paulovich, et al. JPR, 2007• Study of normal and tumor mammary
tissue by LC-MS/MS• 1.4 million MS/MS spectra
• Peptide-spectrum assignments• Normal samples (Nn): 161,286 (49.7%)• Tumor samples (Nt): 163,068 (50.3%)
• 4270 proteins identified in total31
Drawing Balls from Urns
32All Normal SpectraAll Tumor Spectra
Plastin-2 (Lcp1) 827 102 2.437E-123
Osteopontin (Spp1) 334 19 2.444E-62
Hypoxia up-regulated protein 1 (Hyou1) 200 7 1.437E-40
Functional Enrichment
• 374 proteins with "significantly" increased abundance in tumor tissue• Use 4270 proteins as background!
• DAVID gene set enrichment:• Protein translation• RNA binding, splicing
33
Differential Spectral Counting
• Assumptions of the formal tests (Fisher exact, χ2) are violated, so• p-values can be misleading (too small)• Use label permutation tests to compute
empirical p-values. SLOW!• Collapse spectral counts to protein sets
(GO terms) directly:• Potential to observe more subtle spectral
count differences
34
35
Unannotated Splice Isoform
36
Unannotated Splice Isoform
37
Halobacterium sp. NRC-1ORF: GdhA1
• K-score E-value vs PepArML @ 10% FDR• Many peptides inconsistent with annotated
translation start site of NP_279651
0 40 80 120 160 200 240 280 320 360 400 440
What if there is no "smoking gun" peptide…
38
What if there is no "smoking gun" peptide…
39
What if there is no "smoking gun" peptide…
40
PKM2 in Peptide Atlas
41
expe
rimen
ts
peptides
What if there is no "smoking gun" peptide…
42
?
Nascent polypeptide-associated complex subunit alpha
• Long form is "muscle-specific"• Exon 3 is missing from short form
• Peptide identifications provide evidence for long form only• 9 peptides are specific to long form• 6 peptides are found in both isoforms
• Urn with balls of 15 different colors• p-value of observed spectral counts: 7.3E-8
43
Nascent polypeptide-associated complex subunit alpha
44
Pyruvate kinase isozymes M1/M2
• Exon "substitution" changes sequence in the middle of the protein
• Peptide identifications provide evidence for both isoforms• 3 peptides are specific to isoform 1• 5 peptides are specific to isoform 2
• Urn with balls of 63 colors for isoform 1• p-value of observed spec. counts: 2.46E-05
45
46
Pyruvate kinase isozymes M1/M2
Summary
• Systems biology requires:• Experiments, Databases, Models• Informaticians and Disease Experts
• Functional Enrichment:• Quickly navigate knowledge databases using
experiment derived genes• Classical probability experiment: Balls & Urns• How surprised should you be?• Still require domain expert to pick out gems
47
Summary
• Proteomics:• High-throughput protein comparison• Proteome "sample" is identified• Crude spectral count quantitation
• Differential protein abundance:• Use Balls & Urns to find significant changes• Apply functional enrichment tools
• Splicing detection:• Perturbed peptide spectral counts provide
evidence for splicing.• Evaluate using Balls & Urns48