Director The Cancer Genome Atlas - National Human Genome Research
Human Cancer Genome Project Computational Systems Biology of Cancer: (I)
-
date post
19-Dec-2015 -
Category
Documents
-
view
227 -
download
13
Transcript of Human Cancer Genome Project Computational Systems Biology of Cancer: (I)
Human Cancer Genome Project
Computational Systems Biology of Cancer:
(I)
Human Cancer Genome Project
Bud MishraBud Mishra
Professor of Computer Science, Mathematics and
Cell Biology¦
Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai
School of Medicine
Human Cancer Genome Project
Human Cancer Genome Project
War on CancerWar on Cancer
• “Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know.”– US Secretary of Defense, Mr. Donald
Rumsfeld, Quoted completely out of context.
Human Cancer Genome Project
Introduction: Cancer and Introduction: Cancer and Genomics:Genomics:
What we know & what we do not
“Cancer is a disease of the genome.”
Human Cancer Genome Project
OutlineOutline
• Genomics• Genome Modification & Repair• Segmental Duplications Models
Human Cancer Genome Project
GenomicsGenomics
• Genome:– Hereditary information of an organism
is encoded in its DNA and enclosed in a cell (unless it is a virus). All the information contained in the DNA of a single organism is its genome.
• DNA molecule can be thought of as a very long sequence of nucleotides or bases:
= {A, T, C, G}
Human Cancer Genome Project
ComplementarityComplementarity• DNA is a double-
stranded polymer and should be thought of as a pair of sequences over .
• However, there is a relation of complementarity between the two sequences: – A , T, C , G
Human Cancer Genome Project
DNA Structure.DNA Structure.
• The four nitrogenous bases of DNA are arranged along the sugar- phosphate backbone in a particular order (the DNA sequence), encoding all genetic instructions for an organism.
• Adenine (A) pairs with thymine (T), while cytosine (C) pairs with guanine (G).
• The two DNA strands are held together by weak bonds between the bases.
Human Cancer Genome Project
Structure and Structure and ComponentsComponents
• Complementary base pairs(A-T and C-G)
• Cytosine and thymine are smaller (lighter) molecules, called pyrimidines
• Guanine and adenine are bigger (bulkier) molecules, called purines.
• Adenine and thymine allow only for double hydrogen bonding, while cytosine and guanine allow for triple hydrogen bonding.
Human Cancer Genome Project
Inert & RigidInert & Rigid
• Thus the chemical (hydrogen bonding) and the mechanical (purine to pyrimidine) constraints on the pairing lead to the complementarity and makes the double stranded DNA both chemically inert and mechanically quite rigid and stable.
Human Cancer Genome Project
The Central DogmaThe Central Dogma• The central dogma(due
to Francis Crick in 1958) states that these information flows are all unidirectional:“The central dogma
states that once `information' has passed into protein it cannot get out again.”
DNA RNA ProteinTranscription Translation
Human Cancer Genome Project
The Central DogmaThe Central Dogma
“…The transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein, may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein.”
DNA RNA ProteinTranscription Translation
Human Cancer Genome Project
The New SynthesisThe New Synthesis
DNA RNA ProteinGenome Evolution
Selection
Part-lists, Annotation, Ontologies
Transcription Translation
Gen
oty
pe
Ph
en
oty
pe
Human Cancer Genome Project
Cancer Initiation and Cancer Initiation and ProgressionProgression
Cancer Initiation and Progression
Mutations, Translocations, Amplifications, Deletions
Epigenomics (Hyper & Hypo-Methylation)
Alternate Splicing
Proliferation, Motility, Immortality, Metastasis, Signaling
Human Cancer Genome Project
Multi-step Nature of Multi-step Nature of Cancer:Cancer:
• Cancer is a stepwise process, typically requiring accumulation of mutations in a number of genes.
• ~6-7 independent mutations typically occur over several decades:– Conversion of proto-oncogenes to
oncogenes– Inactivation of tumor suppressor
gene
Human Cancer Genome Project
Amplifications & DeletionsAmplifications & Deletions
Mutation in a TSG
Epigenomics
Conversion of aProto-Oncogene
Deletion of a TSG
Deletion of a TSG
Human Cancer Genome Project
P53 Gene (TSG)P53 Gene (TSG)
Human Cancer Genome Project
The Cancer Genome AtlasThe Cancer Genome Atlas
• Obtain a comprehensive description of the genetic basis of human cancer. – Identify and characterize all the sites
of genomic alteration associated at significant frequency with all major types of cancers.
Human Cancer Genome Project
The Cancer Genome AtlasThe Cancer Genome Atlas
• Increase the effectiveness of research to understand– tumor initiation and progression,– susceptibility to carcinogensis,– development of cancer therapeutics,– approaches for early detection of
tumors &– the design of clinical trials.
Human Cancer Genome Project
Specific GoalsSpecific Goals
• Identify all genomic alterations significantly associated with all major cancer types.
• Such knowledge will propel work by thousands of investigators in cancer biology, epidemiology, diagnostics and therapeutics.
Human Cancer Genome Project
To Achieve this goal …To Achieve this goal …
• Create large collection of appropriate, clinically annotated samples from all major types of cancer; and
• Characterize each sample in terms of:– All regions of genomic loss
or amplification,– All mutations in the coding
regions of all human genes,– All chromosomal
rearrangements,– All regions of aberrant
methylation, and– Complete gene expression
profile, as well as other appropriate technologies.
Human Cancer Genome Project
Biomedical RationaleBiomedical Rationale
• Cancer is a heterogeneous collection of heterogeneous diseases. – For example, prostate cancer can be an
indolent disease remaining dormant throughout life or an aggressive disease leading to death.
– However, we have no clear understanding of why such tumors differ.
Human Cancer Genome Project
Biomedical RationaleBiomedical Rationale
• Cancer is fundamentally a disease of genomic alteration.– Cancer cells typically carry many genomic
alterations that confer on tumors their distinctive abilities (such as the capacity to proliferate and metastasize, ignoring the normal signals that block cellular growth and migration) and liabilities (such as unique dependence on certain cellular pathways, which potentially render them sensitive to certain treatments that spare normal cells).
Human Cancer Genome Project
HistoryHistory
• 1960s– The genetic basis of cancer was clear from cytogenetic
studies that showed consistent translocations associated with specific cancers (notably the so-called Philadelphia chromosome in chronic myelogenous leukemia).
• 1970s– Recognize specific cancer-causing mutations through
recombinant DNA revolution of the 1970s.– The identification of the first vertebrate and human
oncogenes and the first tumor suppressor genes,– These discoveries have elucidated the cellular pathways
governing processes such as cell-cycle progression, cell-death control, signal transduction, cell migration, protein translation, protein degradation and transcription.
• For no human cancer do we have a comprehensive understanding of the events required.
Human Cancer Genome Project
Scientific Foundation for a Human Scientific Foundation for a Human Cancer Genome ProjectCancer Genome Project
• Gene resequencing.– Specific gene classes (such as
kinases and phosphatases) in particular cancer types.
• Epigenetic changes.– Loss of function of tumor
suppressor genes by epigenetic modification of the genome — such as DNA methylation and histone modification.
• Genomic loss and amplification. – Consistent association with
genomic loss or amplification in many specific regions, indicating that these regions harbor key cancer associated genes
• Chromosome rearrangements.– Activate kinase pathways
through fusion proteins or inactivating differentiation programs through gene disruption.
– Hematological malignancies: a single stereotypical translocation in some diseases (such as CML) and as many as 20 important translocations in others (such as AML).
– Adult solid tumors have not been as well characterized, in part owing to technical hurdles.
Human Cancer Genome Project
Human Genome StructureHuman Genome Structure
Human Cancer Genome Project
EBDEBD
• J.B.S. Haldane (1932):– “A redundant duplicate of
a gene may acquire divergent mutations and eventually emerge as a new gene.”
• Susumu Ohno (1970):– “Natural selection merely
modified, while redundancy created.”
Human Cancer Genome Project
Evolution by DuplicationEvolution by Duplication
Genome Evolution
Non-random distribution in Genomic data
Short words frequency Protein family sizeLong range orrelation
Motifs in cellular etworksLeu3
LEU1 BAT1 ILV2
UMP UDP
UTP CTP
Regulatory motifs Metabolic pathways Interaction clusters
Functional modules
Living cells
Organism
?
Human Cancer Genome Project
Human ConditionHuman Condition
Human Cancer Genome Project
Mer-scape…Mer-scape…
•Overlapping words of different sizes are analyzed for their frequencies along the whole human genome
•Red: 24-mers,• Green: 21-mers• Blue:18 mers• Gray:15 mers
• To the very left is a ubiquitous human transposon Alu. The high frequency is indicative of its repetitive nature.•To the very right is the beginning of a gene. The low frequency is indicative of its uniqueness in the whole genome.
Human Cancer Genome Project
Doublet RepeatsDoublet Repeats
• Serendipitous discovery of a new uncataloged class of short duplicate sequences; doublet repeats.– almost always < 100 bp
• (Top) . The distance between the two loci of a doublet is plotted versus the chromosomal position of the first locus.
• (Bottom) : Distribution of doublets (black) and segmental duplications (red) across human
chromosome 2 0.00 57500000.00 115000000.00 172500000.00 230000000.00
Start of 10 Mb window on chromosome 2
0
200
400
600
800
Nu
mb
er
of
ele
me
nts
in
10
Mb
win
do
w
Human Cancer Genome Project
Segmental DuplicationsSegmental Duplications
• 3.5% ~ 5% of the human genome is found to contain– segmental duplications,
with length > 5 or 1kb, identity > 90%.
• These duplications are estimated to have emerged about 40Mya under neutral assumption.
• The duplications are mostly interspersed (non-tandem), and happen both inter- and intra-chromosomally.
From [Bailey, et al. 2002]
Human
Human Cancer Genome Project
The ModelThe Model
Duplication by recombination between repeats
Duplication by recombination between other repeats or other mechanisms
insertion
insertion deletion or
mutation
deletion or mutation
f - -
f ++
f + -
Mutation accumulation in the duplicated sequences
f - -
f ++
f + -
Human Cancer Genome Project
The Mathematical ModelThe Mathematical Model
0 ≤ d < ε ε ≤ d < 2ε (k-1)ε ≤ d < kε
H0
H1
α1-α-2β
1-α-2γ
1-α-β/2-γ
α
α
α
α
α
α
α
α
γ 2β
2γ β/2 2γ β/2 2γ β/2
γ 2β γ 2β
1-α-2β 1-α-2β
1-α-β/2-γ 1-α-β/2-γ
1-α-2γ 1-α-2γ
h0
h1 h1++
h1--
h0+-
h0--
h0++
f - -
f ++
f + -
α
α
α
Time after duplication
h1: proportion of duplications by repeat recombination;
h1++: proportion of duplications by recombination of the specific repeat;
h1- - : proportion of duplications by recombination of other repeats;
h0: proportion of duplications by other repeat-unrelated mechanism;
h0++: proportion of h0 with common specific repeat in the flanking regions;
h0+-: proportion of h0 with no common specific repeat in the flanking regions;
h0- -: proportion of h0 with no specific repeat in the flanking regions;
α: mutation rate in duplicated sequences;
β: insertion rate of the specific repeat;
γ: mutation rate in the specific repeat;
d: divergence level of duplications;
ε: divergence interval of duplications.
Human Cancer Genome Project
Model FittingModel Fitting
Diversity:
f - -
f ++
f + -
Alu
Diversity:
f - -
f ++
f + -
L1
•The model parameters (αAlu, βAlu, γAlu, αL1, βL1, γL1) are estimated from the reported mutation and insertion rates in the literature.
•The relative strengths of the alternative hypotheses can be estimated by model fitting to the real data.
h1Alu ≈ 0.76; h1++
Alu ≈ 0.3; h1L1 ≈ 0.76; h1++ L1 ≈ 0.35.
•The model parameters (αAlu, βAlu, γAlu, αL1, βL1, γL1) are estimated from the reported mutation and insertion rates in the literature.
•The relative strengths of the alternative hypotheses can be estimated by model fitting to the real data.
h1Alu ≈ 0.76; h1++
Alu ≈ 0.3; h1L1 ≈ 0.76; h1++ L1 ≈ 0.35.
Human Cancer Genome Project
Polya’s UrnPolya’s Urn
Random drawn
Duplication
Put back
Human Cancer Genome Project
Repetitive Random Eccentric Repetitive Random Eccentric GODGOD
• Genome Organizing Devices (GOD)• Polya’s Urn Model:
• F’s: functions deciding probability distributions
F1 (decide an initial position)
F2 (decide selected length)
F3 (decide copy number)
k copies
F4 (decide insertion positions)
Human Cancer Genome Project
• Insertion (duplication):
3’ realignment and polymerase
reloading
normal replication
polymerase pausing and dissociation
• Deletion:
polymerase pausing and dissociation
normal replication
DNA polymerase stutteringDNA polymerase stuttering(replication slippage)(replication slippage)
DNA polymerase
3’ realignment and polymerase reloading
Human Cancer Genome Project
DNA is repaired-resulting in a duplication of the transposon and target site
Transposon inserted
Target DNA
TransposonsTransposons~~ causes: deletions and duplications causes: deletions and duplications
•Deletion:
•Duplication:
transposon
IS
IS
•Transposon actions ingenomic DNA:
transposon
Transposon looping out
Transposon deleted
Donor DNA
DNA intermediates
Transposase cuts in target DNA
Human Cancer Genome Project
DNA mismatch repair DNA mismatch repair mechanismmechanism
~ prevents duplications and deletions~ prevents duplications and deletions
MSH2 MSH6
MLH1 PMS2
Exo
MSH6MSH2
PMS2
MLH1
MSH6
MSH2
DNA polymerase
Mismatch recognition on
daughter strand
DNA -loop formation by translocation through the
proteins
Degradation of the
mismatched daughter strand in the -loop
Refilling the gap by DNA polymerase Corrected
daughter strand
Human Cancer Genome Project
Graph ModelGraph Model
Graph description•G = {V, E}, G is a directed multi-graph;
•V {Vi, all n-mer’s, i = 1…4n}, ;
•E { (Vi , Vj ), when Vi represents the n-mer immdiately upstream of the n-mer represented by Vj in the genomic sequence};
•ki = incoming (or outgoing) degree of node i (Vi) = copy number of the n-mer represented by Vi;
•During the graph evolution, at each iteration, one of the following happens: deletion (with probability p0), duplication (with probability p1) or substitution (with probability q). p0 + p1 + q = 1.
•For an arbitrary node Vi , the probabilities of one of the above events happens is as follows:
C. Substitution: With probability q
A. Deletion: With probability p0
B. Duplication:With probability p1
;)dsubstitute is (
1
N
jj
i
i
k
kqVP
;1
1)ssubstitute (
NqVP
i
;)deleted is (
1
0
N
jj
i
i
k
kpVP
;)duplicated (
1
1
N
jj
i
i
k
kpVP
Human Cancer Genome Project
Model FittingModel Fitting
Real distribution from genome analysis
Expected distribution from random sequences
Model fitting results
Human Cancer Genome Project
Model Fitted ParameterModel Fitted Parameter
Mer size 6 mer 7 mer 8 mer 9 mer
q p1/p0 q p1/p0 q p1/p0 q p1/p0
M. genitalium 0.0176 1.1 0.0436 1.1 0.1587 1.5 0.3222 1.5
M. pneumoniae
0.0319 1.1 0.1151 1.5 0.2309 1.5 0.4363 1.5
P. abyssi 0.0269 1.1 0.0672 1.2 0.1778 1.4 0.3897 1.5
P. horikoshii 0.0234 1.1 0.0443 1.1 0.1390 1.3 0.3456 1.5
P. furiosus 0.0213 1.1 0.0384 1.1 0.1119 1.2 0.3114 1.5
H. pylori 0.0180 1.1 0.0320 1.1 0.0925 1.3 0.2262 1.5
H. influenzae 0.0202 1.1 0.0366 1.1 0.1364 1.5 0.2802 1.5
S. tokodaii 0.0180 1.1 0.0320 1.1 0.0925 1.3 0.2262 1.5
S. subtilis 0.0187 1.1 0.0326 1.1 0.1139 1.4 0.2585 1.5
E. coli K12 0.0207 1.1 0.0334 1.1 0.0698 1.1 0.2389 1.5
S. cerevisiae 0.0113 1.1 0.0176 1.1 0.0459 1.2 0.1311 1.4
C.elegans -- -- 0.0076 1.1 0.0115 1.1 0.0275 1.2
•The substitution rate, q, increases with the sizes of mer’s.
•The ratio between duplication and deletion rate, p1/p0, increases with sizes of mer’s.
•The substitution rate, q, tends to decrease when the genome sizes are larger. Especially, q is much smaller in eukaryotic genomes than in prokaryotic genomes.
•The substitution rate, q, increases with the sizes of mer’s.
•The ratio between duplication and deletion rate, p1/p0, increases with sizes of mer’s.
•The substitution rate, q, tends to decrease when the genome sizes are larger. Especially, q is much smaller in eukaryotic genomes than in prokaryotic genomes.
Human Cancer Genome Project
J.B.S. HaldaneJ.B.S. Haldane
• “If I were compelled to give my own appreciation of the evolutionary process…, I should say this: In the first place it is very beautiful. In that beauty, there is an element of tragedy…In an evolutionary line rising from simplicity to complexity, then often falling back to an apparently primitive condition before its end, we perceive an artistic unity …
• “To me at least the beauty of evolution is far more striking than its purpose.”
• J.B.S. Haldane, The Causes of Evolution. 1932.
Human Cancer Genome Project
Human Cancer GenomeHuman Cancer Genome
Human Cancer Genome Project
CancerCancer
Normal epithelial mucosaNeoplastic polyp
Human Cancer Genome Project
A ChallengeA Challenge
• “At present, description of a recently diagnosed tumor in terms of its underlying genetic lesions remains a distant prospect. Nonetheless, we look ahead 10 or 20 years to the time when the diagnosis of all somatically acquired lesions present in a tumor cell genome will become a routine procedure.”
– Douglas Hanahan and Robert Weinberg
• Cell, Vol. 100, 57-70, 7 Jan 2000
Human Cancer Genome Project
KaryotypingKaryotyping
Human Cancer Genome Project
CGH:CGH:Comparative Genomic Hybridization.Comparative Genomic Hybridization.
• Equal amounts of biotin-labeled tumor DNA and digoxigenin-labeled normal reference DNA are hybridized to normal metaphase chromosomes
• The tumor DNA is visualized with fluorescein and the normal DNA with rhodamine
• The signal intensities of the different fluorochromes are quantitated along the single chromosomes
• The over-and underrepresented DNA segments are quantified by computation of tumor/normal ratio images and average ratio profiles
Amplification
Deletion
Human Cancer Genome Project
CGH: Comparative Genomic CGH: Comparative Genomic Hybridization.Hybridization.
Human Cancer Genome Project
Microarray Analysis of Cancer Microarray Analysis of Cancer GenomeGenome
• Representations are reproducible samplings of DNA populations in which the resulting DNA has a new format and reduced complexity.– We array probes derived from low
complexity representations of the normal genome
– We measure differences in gene copy number between samples ratiometrically
– Since representations have a lower nucleotide complexity than total genomic DNA, we obtain a stronger specific hybridization signal relative to non-specific and noise
Normal DNA
Normal LCR
Tumor DNA
Tumor LCR
Label
Hybridize
Human Cancer Genome Project
A1
A2
A3
B1
B2
B3
C1
C2
C3
Copy Number FluctuationCopy Number Fluctuation
Human Cancer Genome Project
Measuring gene copy number differences Measuring gene copy number differences between complex genomesbetween complex genomes
• Compare the genomes of diseased and normal samples
• Error Control:– The use of representations
augmenting microarrays– Representations
reproducibly sample the genome thereby reducing its complexity. This increases the signal-to-noise ratio and improves sensitivity
– Statistical Modeling the sources of Noise
– Bayesian Analysis
Hybridize to 500,000 features
microarray
Genomic DNA patient samples
Statsitical Analysis
Human Cancer Genome Project
Nattering Nabobs of Nattering Nabobs of NegativismNegativism
• Some scientists are concerned about the cost and the possibility that a project of this scale could take money away from smaller ones…
• Craig Venter, who led a private project to determine the human DNA blueprint in competition with the human genome project, said it would make more sense to look at specific families of genes known to be involved in cancer.
• Lee Hood, president of the Institute for Systems Biology, has called the premise of the Cancer Genome Project “naïve,” suggesting that signal-to-noise issues its researchers are likely to encounter will be “absolutely enormous.”
Human Cancer Genome Project
ChallengesChallenges
• ArrayCGH data is noisy!– Better computational biology algorithms– Better statistical modeling…In some technologies, the
noise is systematic (e.g., affy SNP-chips)– Better bio-technologies (GRIN: Genomics, Robotics,
Informatics, Nanotechnology)• No efficient technology for epigenomics,
translocations or de novo mutations.• Algorithms for multi-locus association studies• Systems view of cancer by integrating data from
multiple sources:– Genomic, Epigenomic, Transcriptomic, Metabolomic &
Proteomic.– Regulatory, Metabolic and Signaling pathways
Human Cancer Genome Project
Mishra’s Mystical 3M’sMishra’s Mystical 3M’s
• Rapid and accurate solutions – Bioinformatic,
statistical, systems, and computational approaches.
– Approaches that are scalable, agnostic to technologies, and widely applicable
• Promises, challenges and obstacles—
Measure
Mine
Model
Human Cancer Genome Project
DiscussionsDiscussions
Q&A…
Human Cancer Genome Project
Answer to CancerAnswer to Cancer
• “If I know the answer I'll tell you the answer, and if I don't, I'll just respond, cleverly.”– US Secretary of Defense, Mr.
Donald Rumsfeld.
Human Cancer Genome Project
To be continued…To be continued…
Break…