Intro to Comp Genomics Lecture 7: Using large scale functional genomics datasets.
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Intro to Comp Genomics Lecture 7: Using large scale functional genomics datasets.
Intro to Comp Genomics
Lecture 7: Using large scale functional genomics datasets
Your Task
B
P1
P2
P3
P..
Preparations:• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg chr17, bin-size = 50bp
• Cut the data into segments of 50,000 data points
Modeling:• Use EM to build a probabilistic model
for the peak signals and the background.
• Use heuristics for peak finding to initialize the EM
Analysis:• Test if your model for single peak
structure is as good as the model for two peak structures.
• Compute the distribution of peaks relative to transcription start sites
Your TaskModeling
),;()|( 111 xNPxP
),;()|( 222 xNPxP
),;()|( 333 xNPxP
),;()|( 444 xNPxP
),;()|( xNBxP
The model use k-states for the peak and one state for the backgroundUse K=40.
S
F
Your Task
Preparations:• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg chr17, bin-size = 50bp
• Cut the data into segments of 50,000 data points
Modeling:• Use EM to build a probabilistic model
for the peak signals and the background.
• Use heuristics for peak finding to initialize the EM
Analysis:• Test if your model for single peak
structure is as good as the model for two peak structures.
• Compute the distribution of peaks relative to transcription start sites
Your TaskModeling
Implement HMM inference: forward-backward
Make sure your total probability is the same in the forward and the backward forms!
Implement the EM update rules
Run EM from multiple random points and record the likelihoods you derive
Implement smarter initialization: take the average values around all probes with value over a threshold.
Compute posterior peak probabilities: report all loci with P(Peak)>0.8
Your Task
Preparations:• Get your hand on the ChIP-seq
profiles of CTCF and PolII in hg chr17, bin-size = 50bp
• Cut the data into segments of 50,000 data points
Modeling:• Use EM to build a probabilistic model
for the peak signals and the background.
• Use heuristics for peak finding to initialize the EM
Analysis:• Test if your model for single peak
structure is as good as the model for two peak structures.
• Compute the distribution of peaks relative to transcription start sites
Your TaskAnalysis
Compare the two peak structures you get (from CTCF and PolII)
Retrain a model together on the two datasets
Compute the log-likelihood of the unified model and compare to the sum of likelihood for the two models
Optional: test if the difference is significant by:-sampling data from the unified model-training two models on the synthetic data and compute the likelihood delta as for real data
-Use a set of known TSSs to compute the distribution of peaks relative to genes
Functional genomics
• 10 years after the appearance of microarrays, thousands of experiments were performed on different cells and conditions
• One of the original promises of the technology is that it will for a vast body of data that can serve future modeling and analysis purposes
• Standards have been established, and it is mandatory to deposit data high throughput datasets when publishing papers describing it
• Unlike pubmed for literature or blast/blat for sequence, the functional genomics database is not usable using a single simple tool
• We will discuss and practice some strategies for utilizing this powerful resource
Platform
Sample
Series
NCBI - GEO
Data availability
GEO: 268,611 experiments (!!)5343 platforms
(Any species, condition, experiment)
Mandatory submission for all published papers
Also: EBI-Array express
Challenge: find what you need
Specific databases are curated and organized:
Species: e.g., SGD for yeast
Disease: e.g., Oncomine for cancer – 28,800 arrays organized around specific cancer types
Gene expression:
Different sets of genes or gene model! Still most of the data Conditions are critical
Comparative genomic hybridization (aCGH):
Important for disease with genomic aberrations
TF binding profiles
Old type: gene arrays Currently: Tiling array or ChIP-seq
Phenotype?Other specific assays?
Gene expression data is using different platforms (old cDNA, affy, new long oligo arrays)
Vastly different gene sets and gene models
RNA genes are now on most arrays
Understanding the experimental conditions for each array is a challenge
Avoiding replicates or using them smartly
Be careful from systematic pre-normalization of original data – subtracting the median/mean from a specific dataset introduce a strong bias for all the arrays in it when compared to other datasets!
Transcription factor interactions, histone modifications maps:
Genes bound by certain TFs
Genes (or regions) enriched for specific histone modifications
Hundreds of factors and modificationsDifferent experimental conditionsAbundant data for yeast,flies,mouse and human
Histone modifications
Knock-down/knock-out library phenotype
Library of mutants lacking each of the non-essential yeast genes is available (knockout)
Essential genes can be knocked down using a sepcialized promoter
Libraries can be automatiaclly screened for viability and/or growth rate in different conditions using robotics and 96/384 well plate formats
Libraries of RNAi construct allow similar screens for worms and flies.
Mammalian screens are becoming possible as well
Genetic interactions
Testing the phenotype of multi-gene knockout provide key insights into the genetic network
A gene may be essential fro growth under some condition, but become dispensable when another gene is knocked-down
A mutation can be lethal only in the presence of another knockout (synthetic lethality)
In yeast, systematic screens for synthetic lethality are practical for over 5 years.
Genetic interactions
Improved technology provide more quantitative measurement of the growth phenotype of double knock-down
Matching all pairs of a genes in a large subset of the genome is practical, and the resulted EMAP provide qunatitative estimate to the epistasis in the group (e.g., Schuldiner lab here at WIS)
?)()()( XBfAfABf
Protein interactions
Physcial interaction between proteins highlight post-translational regulatory networks and structural organization of key organelles
Data comes from several technologies:
most reliably techniques involving Mass spectrometry and isolation of protein complexes.
Indirect techniques involving transcriptional assays (yeast-two hybrid)
And more..
Data is partial and sometime difficult to interpret (what do we mean by interaction?)
A large body of literature is dealing with speculation on protein network – relevance to actual biology is questionable…
Array CGH/genetic aberrations
Data on deletion/insertion and copy number variation is generated by hybridization to arrays or more recently through sequencing
Data is critical for studies of cancer .
Databases also incule lists of genomic loci that are known to be instable in (specific types of) cancer.
Gene ontology
Hierarchical vocabulary (GO terms)
Unifying different research communities
Process-…Function-…Component-..
Annotations: association of term with gene in a specific species
Also associating all super-terms
GO-Slim is a flat version of the ontologies
Z-scores, T-test – the basics
BABA
BBAA
BA
nnnnSnSn
XXt
112)1()1( 22
You want to test if the mean (RNA expression) of a gene set A is significantly different than that of a gene set B.
If you assume the variance of A and B is the same:
t is distributed like T with nA+nB-2 degrees of freedom
If you don’t assume the variance is the same:
)1/()1/(/:..222222
22
BB
BA
A
A
B
B
A
A
B
B
A
A
BA
nn
sn
n
s
n
s
n
sfod
ns
ns
XXt
But in this case the whole test becomes rather flaky!
In a common scenario, you have a small set of genes, and you screen a large set of conditions for interesting biases.
You need a quick way to quantify deviation of the mean
For a set of k genes, sampled from a standard normal distribution, how would the mean be distributed?
)1
,0(K
NThe Mean
So if your conditions are normally distributed, and pre-standartize to mean 0, std 1
You can quickly compute the sum of values over your set and generate a z-score
|| A
XZ A
Kolmogorov-smirnov statistics
|)()(|max
|)()(|max
22xSxSD
xPxSD
NNx
Nx
1
21 22
)1(2)(j
jjKS eQ
The D-statistics is a-parameteric: you can transform x arbitrarly (e.g. logx) without changing it
The D statistics distribution is given by a the form:
)/11.012.0(
)(21
21
DNNQ
observedDP
NN
NNN
eeKS
e
An a-parameteric variant on the T-test theme is the Mann-Whitney test.
You Take your two sets and rank them together. You count the ranks of one of your set (R1)
2
)1( 111
nnRU
12
)1(
2/
),(~
2121
21
nnnn
nn
NU
U
U
UU
Hyper-geometric and chi-square test
B
A
B
A
n
N
k
n
kn
nN
kBAP )|(|
A
B
Nnnn
nnnn
nnnn
nnnn
321
3333231
2232221
1131211
ji ji
jiji
nN
nnn
, ,
2,,,
2)(
Chi-square distributed with m*n-m-n+1 d.o.f.
Testing hypotheses on interaction graphs
Given your gene set and a set of gene-gene or protein-protein interactions.
How can you test if your set is enriched in intra- interactions?
Criterion for an additional gene that is strongly interaction with your set?
Node’s degree in the graph?
Overall network density?
Are complex tend to be split by your set or maybe tend to be contained in the set?
The iterative signature algorithm
1,Ae jAe , mAe ,
0,GA
}|{ ,1,G
jAC Tk
ejA
})(|{1, thresjpvaljAC
Simple statistics:
Plug in your favorite:
Matrix normalized for conditionsAe ,1
Aje ,
Ane ,
1,CA
}||
|{1,
,1,CC
AiG TA
eiA
})(|{1, thresipvaliAG
Simple statistics:
Plug in your favorite:
Mat
rix n
orm
aliz
ed fo
r co
nditi
ons
The iterative signature algorithm
1,Ae jAe , mAe ,
iterGA ,
}|{ ,1,G
jAC Tk
ejA
})(|{1, thresjpvaljAC
Simple statistics:
Plug in your favorite:
Iterate until convergence (Small changes in gene/condition sets)
Convergence is not guaranteed..
Try starting from your target gene set or from random sets.
Thresholds are critical
Variants: use a weighted average instead of plain average
Allow signs for conditions
Different statistics for thresholding (a-parametericKS/MW? Parameteric non-normal?
Can you think of a probabilistic version?
Ae ,1
Aje ,
Ane ,
iterCA ,
A Probabilistic formulation
01 d
))1;0(1();()Pr( NcdNcde jijiij
Mat
rix n
orm
aliz
ed fo
r co
nditi
ons
11id
11id
),|1Pr( 0dec j
jji
jji
jji
i NcdNcd
Ncd
d))1;0(1();(
);(
iji
iji
iji
j NcdNcd
Ncdc
))1;0(1();(
);(
Pros and cons?
Playing with the condition/gene means?
Convergence?
Multiple-testing
Testing for high mean of your gene set in 100,000 conditions in the database.
You expect to get one case with p<0.00001 !
Stringent correction: multiply the p-value by the number of tests
A rational alternative: control the false-discovery rate (FDR):
10 times “hits” than expected errors
In many cases, your tests are not really independent
For example, testing enrichment for functional annotations that are hierarchical
Another example are multiple gene expression conditions that are very similar (same tumor type)
You can estimate the empirical distribution of your statistics on random sets of the same size and use this as your p-value
This should be done with care: making sure your sampled sets are really similar in nature to your true sets and controlling for effects you want to factor out.
P-valuecutoff
Go term 1
Go
term
2
Your Task
• Download the GNF human expression atlas from UCSC genome browser or GEO• Find 1-5 datasets on breast cancer in GEO• Combine IDs, merge the dataset• Download gene ontologies human associations. Extract gene set(s) related to
apoptosis and to cell cycle.• Use your previous analysis of chromosome 17 to generate the set of 40 genes for
which the 20k window containing their promoter had the lowest correlation to the overall k-mer spectrum
• Also generate a set of 40 chr17 genes with the highest G+C content on the 1kb upstream their promoter (you can use the Genome browser tools for that)
• Implement your version of the iterative signature algorithm (you are free to select the statistics you are using). You can implement the deterministic or probabilistic version.
• Starting from the above gene set, see if and how your algorithm is converging. Compute the intersection of the converged set with the original sets and report the conditions you found
• Change your algorithm parameters to get smaller or larger biclusters, plot the size of the resulted sets as a function of the parameter you are changing
Your Task