A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury...

31
A method to recognize universal patterns in genome structure using Hi-C Neil Chowdhury Phillips Exeter Academy Mentor: Sameer Abraham Massachusetts Institute of Technology

Transcript of A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury...

Page 1: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

A method to recognize universal patterns in genome structure using Hi-C

Neil Chowdhury Phillips Exeter Academy

Mentor: Sameer Abraham

Massachusetts Institute of Technology

Page 2: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Introduction to Hi-C

Neil Chowdhury 2

Hi-C Matrix of Rao et al. (2014) GM12878 genome

Chromatin structure in the cell nucleus

Region 1

Region 2

Hi-C process

• Hi-C: Process of finding contact probabilities between every pair of regions in a DNA strand

• Graphed as a square adjacency matrix (rows/cols are regions, cells are probabilities) • Contact probabilities measures the interaction frequency of two regions

Page 3: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Patterns in Hi-C

Datasets used: Schwarzer et al. (2017) NIPBL mutant, Rao et al. (2014) GM12878 Neil Chowdhury 3

Large scale Cis interactions

chr2

chr3

chr2 chr3

Page 4: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Compartmentalization

Neil Chowdhury 4

cis trans

• Checkerboarding pattern • Seen in both cis and

trans interactions • Pattern captured by sign

of eigenvector • Clear that there are

subcompartments within A/B

eigenvector

A

B

B

A A

A B

B

Page 5: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Methodology and Challenges

• Methodology: Use existing clustering techniques from data science to find subcompartments in Hi-C

• Challenges:

• Process Hi-C data into a form amenable to existing clustering algorithms • Assess the quality of clusters • Assess the number of clusters

Neil Chowdhury 5

Page 6: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

What is clustering?

Neil Chowdhury 6

(MathWorks)

• Clusters are tightly packed groups of points in space

• Clustering is the process of

algorithmically finding these groups of points

Page 7: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Clustering algorithms

• K-means • 𝑘𝑘 centroids • each point is in the cluster with the nearest centroid (means) • minimize variance (squared Euclidean distance) within each

cluster • Agglomerative

• start with each point in its own cluster • repeatedly merge a pair of clusters by some linkage criterion

(single, ward, average) until 𝑘𝑘 clusters reached • Spectral

• create an affinity matrix, compute eigenvectors, use k-means to cluster eigenvectors

Neil Chowdhury 7

(images from Wikipedia)

K-means algorithm

Agglomerative algorithm

(finding 𝑘𝑘 clusters in a set of 𝑛𝑛 points in space)

Example of Clusters (Wikipedia)

Page 8: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Creating a matrix for clustering

“odd-even matrix”

Even chromosomes

Odd

chr

omos

omes

chr2

chr1 trans

chr3

chr4 …

Neil Chowdhury 8

• Problem with cis interactions: scaling of contact probability and TADs interfere with compartmentalization

• Construct a matrix with all odd chromosomes vs. all even chromosomes (all trans interactions). Technique used in Rao et al. (2014).

• Clustering rows: treat rows as points in

thousand-dimensional space and columns as dimensions

• Clustering columns: treat columns as

points and rows as dimensions

Page 9: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Visualization of cluster labels Column cluster labels (sorted)

Row

clu

ster

labe

ls (s

orte

d)

Neil Chowdhury 9

1. Run clustering procedure on rows and columns

2. Sort matrix by row and column according to cluster label

3. Plot matrix 4. Add clustering labels

to the top and left

Page 10: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Eigendecomposition vs. clustering (k=2)

Agglomerative (average linkage) K-means

Spectral

Neil Chowdhury 10

Labels obtained from clustering

Labels obtained from eigenvector sign

• Sort by cluster label, but show eigenvector sign on bottom and right

(A: +/red, B: -/blue)

• Each clustering method is able to find the two compartments

Page 11: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Stability original

perturbed

clusters clusters

Rand index (ARI)

Neil Chowdhury 11

• 5% gaussian noise • 75% subsample

Stability metric from Luxburg (2010).

(specify 𝑘𝑘)

Page 12: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

K-means

Neil Chowdhury 12

k=2 k=3

k=4 k=5

k=6 k=7

• k<=4 is most stable

• k=6, k=7 clusters look very similar

noise

subsampling

Page 13: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Agglomerative • Linkage measures the distance between clusters • Average linkage most stable • Stability increases as 𝑘𝑘 gets larger

Neil Chowdhury 13 subsampling noise

Page 14: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Agglomerative (average linkage)

Neil Chowdhury 14

k=30 k=40

k=5 k=10 k=20

• k >= 20: four clusters found, similar to K-means with k=4

Page 15: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Spectral

Neil Chowdhury 15

k=6

k=2

k=7

k=3

k=5 k=4 • K=2 to 7 most

stable • A/B

compartments not replicated

subsampling

noise

Page 16: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Dimensionality reduction

subsampling

noise Neil Chowdhury 16

• Dimensionality reduction using principal component analysis (PCA) improves stability of K-means

• Parallel coordinates plot: 4 clusters have distinct positions in 25-dimensional space (each zig-zag line represents a point)

Page 17: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

K-means (4)

Negative (B) Positive (A)

Neil Chowdhury 17

• Supported by stability analysis • Replicates the results of

agglomerative clustering

Eigenvector:

Page 18: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

ChIP-seq Hi

ston

e m

odifi

er

Stag

e in

cel

l cyc

le

Repli-seq

Neil Chowdhury 18

• A2, B1, B2 deficient in H4k20me1 • B regions generally deficient in all

proteins

Histones (Merriam-Webster)

• Replication order: A1, A2, B1 = B2

Page 19: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

ChromHMM states

Neil Chowdhury 19

• ChromHMM (hidden Markov model) running on ChIP-seq classifies DNA into 15 states (promotors, enhancers, quiescent, transcription start sites, etc.)

• B1 mainly made up of Quies state (light gray)

ChromHMM states

Page 20: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Conclusion

• Demonstrated that there are four nuclear subcompartments with distinctive features

• Framework for clustering Hi-C data

Neil Chowdhury

Hi-C

eigenvector

stability

visual inspection

clusters

ChIP-seq

ChromHMM

Repli-seq

# of clusters

odd-even matrix

20

Page 21: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Future Work

• Compare to physics models of chromatin • Write software pipeline to find the compartments (A1, A2, B1, B2)

given any Hi-C matrix • More sophisticated graph-based clustering techniques (simulated

annealing) • More fine-grained analysis with resolution higher than 1mb

Neil Chowdhury 21

Page 22: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Acknowledgements

• ENCODE project • Hi-C matrices, ChIP-seq, Repli-seq, ChromHMM

• Sameer Abraham for mentoring me • Martin Falk and Prof. Leonid Mirny for inviting me to MIT MirnyLab

and providing me with computational resources • Dr. Slava Gerovitch and Prof. Srini Devadas for inviting me to MIT

PRIMES-CS

Neil Chowdhury 22

Page 23: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Neil Chowdhury 23

Page 24: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

ChIP-seq/Repli-seq enrichments compared to Rao et al. clusters

Neil Chowdhury 24

Rao et al. (2014)

Our subcompartments

Page 25: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Distribution of logged trans contact probabilities

Neil Chowdhury 25

Page 26: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Adjusted Rand Index (ARI)

• X and Y are two clusterings of the same set. Xi and Yi each represent a cluster.

• ARI measures similarity between the two clusterings

Neil Chowdhury 26

Contingency table (Wikipedia)

ARI formula (Wikipedia)

Page 27: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

NIPBL vs. Untreated (UNTR)

• Mouse liver cells • NIPBL is a cohesion loading protein

thought to be responsible for loop extrusion

Neil Chowdhury 27

Page 28: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

ChromHMM

Neil Chowdhury 28

Specify number of chromatin states

(Wikipedia)

Observations: ChIP-seq tracks

Page 29: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Agglomerative (single linkage, k=5)

Neil Chowdhury 29

Page 30: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Comparison with human cells

Neil Chowdhury 30

GM12878 (Human cell line) NIPBL factor removed

Mouse liver

Page 31: A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury 2 Hi-C Matrix of Rao et al. (2014) GM12878 genome . Chromatin structure in the cell

Effect of Dimensionality Reduction

Neil Chowdhury 31