Bioc4010 lectures 1 and 2

124
Next-Generation Sequence Analysis for Biomedical Applications BIOC 4010/5010 Lecture 1 Dr. Dan Gaston Postdoctoral Fellow Department of Pathology Dr. Karen Bedard Lab Bioinformatician, IGNITE Project

description

Introduction to NGS and Bioinformatics for Human Disease Applications

Transcript of Bioc4010 lectures 1 and 2

Page 1: Bioc4010 lectures 1 and 2

Next-Generation Sequence Analysis for Biomedical Applications

BIOC 4010/5010Lecture 1

Dr. Dan GastonPostdoctoral Fellow Department of Pathology

Dr. Karen Bedard LabBioinformatician, IGNITE Project

Page 2: Bioc4010 lectures 1 and 2

LECTURE 1Introduction to Next-Gen Sequencing

Page 3: Bioc4010 lectures 1 and 2

Overview: Lecture 1

• Introduction AKA “Why does this matter?”• “Next-Gen” Sequencing• Bioinformatics Workflows• Types of Next-Gen Experiments• Working with the Human Genome• Slides available on slideshare:

– http://www.slideshare.net/DanGaston

Page 4: Bioc4010 lectures 1 and 2

Major Areas in Human Disease Genomics

• Complex diseases– Genome Wide Association Studies (GWAS)

• Cancer– Tumour genomics (Driver mutations)– Transcriptomics

• Mendelian disease– Whole Genome/Exome Sequencing– Transcriptomics– Genetic Linkage

Page 5: Bioc4010 lectures 1 and 2

Diagnosing Genetic Diseases

• Genetic Counselors/Physicians order individual testing of genes based on patient phenotype

• For rare diseases or unusual phenotypes may run tens to hundreds of tests

• …..EXPENSIVE (Easily thousands of dollars)

Page 6: Bioc4010 lectures 1 and 2

Genetic Disease Research

Page 7: Bioc4010 lectures 1 and 2

Genetic Disease Research: Cutis Laxa

Chromosome 9:120,962,282 -133,033,431

Page 8: Bioc4010 lectures 1 and 2

Cutis Laxa

• Linked Genomic Region ~13Mb in size• Contains 143 Genes• Prioritize and select genes for individual

sanger sequencing• …Slow• …Laborious• …Can be expensive

Page 9: Bioc4010 lectures 1 and 2
Page 10: Bioc4010 lectures 1 and 2

Personalized Medicine

Page 11: Bioc4010 lectures 1 and 2

Human Genomics

• $5,000 - $10,000 to sequence whole genome• $1000 to sequence only protein-coding

portion (exome, later)

Page 12: Bioc4010 lectures 1 and 2

Clinical Genomics

• Rapid diagnosis of genetic disease in NICU cases• Quicker and cheaper than sequential genetic

testing (traditional method)

Page 13: Bioc4010 lectures 1 and 2

Cancer Genomics

Welch JS, et al. JAMA, 2011;305, 1577

Page 14: Bioc4010 lectures 1 and 2

Cancer Chemotherapy Resistance

Page 15: Bioc4010 lectures 1 and 2

Human Disease Genomics at Dalhousie

• IGNITE: Identifying genetic mutations causing rare mendelian diseases in Atlantic Canada– 3 year, $2.5 million Genome Canada Project– Currently working on >10 different diseases including

two inherited cancer’s– Sequenced >20 individual exomes, 4 whole genomes,

and several transcriptomes– More on Thursday…

• Dr. Graham Dellaire: Transcriptome sequencing and analysis on multiple cancer cell lines

Page 16: Bioc4010 lectures 1 and 2
Page 17: Bioc4010 lectures 1 and 2
Page 18: Bioc4010 lectures 1 and 2

Short Reads

Millions of paired “short reads”, 75-150bp each

Page 19: Bioc4010 lectures 1 and 2

FastQ Format

Read ID

Sequence

Quality line

Page 20: Bioc4010 lectures 1 and 2

FastQ Quality Scores

Quality Score (Q) Probability of incorrect base call Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.90%40 1 in 10000 99.99%50 1 in 100000 100.00%

Q = -10 log10 P

Page 21: Bioc4010 lectures 1 and 2

Quality Scores of Sequencing Reads

Page 22: Bioc4010 lectures 1 and 2

General Genomics Workflow

Quality Control of Raw DataRaw Data Analysis

Alignment to reference genome

Whole Genome Mapping

Detection of genetic variation(SNPs, Indels, SV)

Variant Calling

Linking variants to biological information

Annotation

Page 23: Bioc4010 lectures 1 and 2

Short Read Mapping

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATT CGGTATAC

TAGGCTATAAGGCTATATAGGCTATATAGGCTATAT

GGCTATATGCTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT

1) Report location of genome where read matches best2) Minimize mismatches3) Mismatches with lower quality bases better than

mismatches with higher quality bases

Page 24: Bioc4010 lectures 1 and 2

Discovering Genetic VariationSNPs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG

TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC

GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT

ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATTCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA

GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA

TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG

INDELs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA

TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

reference genome

Page 25: Bioc4010 lectures 1 and 2

Next-Gen Sequencing Experiments

• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq

Page 26: Bioc4010 lectures 1 and 2

Next-Gen Sequencing Experiments

• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq

Page 27: Bioc4010 lectures 1 and 2

Composition of Human Genome

Size: 3.2 Gb

Page 28: Bioc4010 lectures 1 and 2

Genomic ContentChromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA

1 249,250,621 4,401,091 2,012 31 1,130 134 66 1062 243,199,373 4,607,702 1,203 50 948 115 40 933 198,022,430 3,894,345 1,040 25 719 99 29 774 191,154,276 3,673,892 718 39 698 92 24 715 180,915,260 3,436,667 849 24 676 83 25 686 171,115,067 3,360,890 1,002 39 731 81 26 677 159,138,663 3,045,992 866 34 803 90 24 708 146,364,022 2,890,692 659 39 568 80 28 429 141,213,431 2,581,827 785 15 714 69 19 55

10 135,534,747 2,609,802 745 18 500 64 32 5611 135,006,516 2,607,254 1,258 48 775 63 24 5312 133,851,895 2,482,194 1,003 47 582 72 27 6913 115,169,878 1,814,242 318 8 323 42 16 3614 107,349,540 1,712,799 601 50 472 92 10 4615 102,531,392 1,577,346 562 43 473 78 13 3916 90,354,753 1,747,136 805 65 429 52 32 3417 81,195,210 1,491,841 1,158 44 300 61 15 4618 78,077,248 1,448,602 268 20 59 32 13 2519 59,128,983 1,171,356 1,399 26 181 110 13 1520 63,025,520 1,206,753 533 13 213 57 15 3421 48,129,895 787,784 225 8 150 16 5 822 51,304,566 745,778 431 21 308 31 5 23X 155,270,560 2,174,952 815 23 780 128 22 52Y 59,373,566 286,812 45 8 327 15 7 2

mtDNA 16,569 929 13 0 0 0 2 22

Page 29: Bioc4010 lectures 1 and 2

Exome Sequencing

Page 30: Bioc4010 lectures 1 and 2

Transcriptomics: RNA-Seq

• Sequence the actively transcribed genes in a cell line or tissue– Only about 20% of genes are transcribed in

particular cell types• Two types:

– Poly-A selection– Total RNA + ribodepletion

• Many experimental questions can be addressed

Page 31: Bioc4010 lectures 1 and 2

RNA-Seq: Gene ExpressionCondition 1

Condition 2

Page 32: Bioc4010 lectures 1 and 2

RNA-Seq: Differential Splicing

Exon1 Exon 2 Exon 3

Page 33: Bioc4010 lectures 1 and 2

RNA-Seq: Novel/Non-Canonical Exon Discovery

Exon1 Exon 2 Exon 3Exon X

Page 34: Bioc4010 lectures 1 and 2

RNA-Seq: Gene Fusion Events

Exon1 Exon 2 Exon 3

Gene 2 Exon 4

Page 35: Bioc4010 lectures 1 and 2

RNA-Seq

• Important to take in to account biological variability. A sample of cells is a mixed population– Replicates!

• Not suited for discovering polymorphisms due to higher error rates introduced by reverse transcription step (RNA -> cDNA)

• High false positive rates for fusion gene discovery, novel exons, when low expression levels

Page 36: Bioc4010 lectures 1 and 2

CHiP-Seq

Page 37: Bioc4010 lectures 1 and 2

CHiP-Seq

Page 38: Bioc4010 lectures 1 and 2

Short Read Mapping: Placing Millions of Reads on Human Reference

• Problem: Efficiently place millions of reads (75bp – 200bp) accurately within 3.2Gb of reference genome

• Problem: Read may match equally well at more than one location (pseudogenes, copy number variation, repetititve elements)

• Problem: Sequencing reads may be paired

Page 39: Bioc4010 lectures 1 and 2

Short Read Mapping: Brute Force Method

Simple conceptually: Compare each query k-mer to all k-mers of genome

Genome Size (N): 3.2 billion basesK-mer length (M): 7Number of comparisons((N-M + 1) * M): 21 billion

Page 40: Bioc4010 lectures 1 and 2

Solution

Index the Reference Genome

Indexing the reference is like constructing a phone book, quickly move towards the relevant portion of the genome and ignore the rest.

Page 41: Bioc4010 lectures 1 and 2

Short Read Alignment: Suffix ArraySplit genome into all suffixes (substrings) and sort alphabetically

Allows query to be searched against an alphabetical reference, skipping 96% of the genome

Ex: bananaSorted:

bananaa

ananaana

nana ananaana bananana nanaa

na

Page 42: Bioc4010 lectures 1 and 2

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Search for GATTACA…

Page 43: Bioc4010 lectures 1 and 2

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Search for GATTACA…

Page 44: Bioc4010 lectures 1 and 2

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Search for GATTACA…

Page 45: Bioc4010 lectures 1 and 2

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Search for GATTACA…

Page 46: Bioc4010 lectures 1 and 2

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Search for GATTACA…

Page 47: Bioc4010 lectures 1 and 2

Binary Search

• Initialize search range to entire list – mid = (hi+lo)/2; middle = suffix[mid] – if query matches middle: done – else if query < middle: pick low range – else if query > middle: pick hi range

• Repeat until done or empty range

Page 48: Bioc4010 lectures 1 and 2

Applied to Human Genome

• In practice simple methods of indexing the genome can create very large data structures– Suffix Array: > 12 GB

• Solution: Apply complex procedures that allow you to index and compress the data:– Burrows-Wheeler Transform– FM-Index

Page 49: Bioc4010 lectures 1 and 2

Short Read Mapping: Mapping Quality

• Have also ignored quality scores of reads• Mapping Quality (for a read): Sum the quality

scores at mismatched bases for alignment (SUM_BASE_Q(best)), also consider all other possible alignments

MQ = -log10 (1 – (10-SUM_BASE_Q(best) /SUMi(10-

SUM_BASE_Q(i))) )

Page 50: Bioc4010 lectures 1 and 2

Short Read Aligners

• BLAT: BLAST-Like Alignment Tool• MAQ: First to take in to account quality scores• BWA: First to use Burrows-Wheeler Transform• Bowtie: Ungapped alignment only• Bowtie2: Allows indels• … and many more

Page 51: Bioc4010 lectures 1 and 2

LECTURE 2Identifying and Annotating Genomic Variation for Disease Gene Discovery

Page 52: Bioc4010 lectures 1 and 2

Genetic Variation

• dbSNP (NCBI) catalogues > 53 million Single Nucleotide Variations (SNVs) in humans– 38 million validated– 22 million in genes– 36 million with frequencies

• 50-80% of mutations involved in inherited disease caused by SNVs

Page 53: Bioc4010 lectures 1 and 2

SNP vs SNV

• Technically a polymorphism is a variation that doesn’t cause disease and is common in a population

• What is common?– Greater than 5% in a population a typical

definition– Definition for rare ranges from < 0.5% to < 1.5%

Page 54: Bioc4010 lectures 1 and 2

Discovering Genetic Variation

SNPs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG

TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC

GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT

ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATTCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA

GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA

TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG

INDELs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA

TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

reference genome

Page 55: Bioc4010 lectures 1 and 2

Variant Calling: The Absurdly Simple Way

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG

TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC

GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT

TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA

reference genome

Read depth at base: 10 T: 4 A: 6

Genotype: Heterozygous A/T

Page 56: Bioc4010 lectures 1 and 2

Variant Calling: The Absurdly Simple Way

• Algorithm:– Count all aligned bases that pass quality threshold

(e.g. >Q20)– If #reads with alternative base > lower bound (20%)

and < upper bound (80%) call heterozygous alt– Else if > upper bound call homozygous alternative– Else call homozygous reference

• …But what about base qualities for more than keeping reads?

Page 57: Bioc4010 lectures 1 and 2

Improving Variant Calling

• MAQ (Mapping and Assembling with Quality):– Short Read Mapper and Genotype Caller– First to use base qualities for either– Introduced mapping Quality

Page 58: Bioc4010 lectures 1 and 2

Improving Variant Calling

① Base quality can not be more reliable than mapping quality of read

② At most individual can have two real nucleotides at a position (two alleles)

① Only consider two most frequent nucleotides② Simplify to two states: A and B

Page 59: Bioc4010 lectures 1 and 2

Improving Variant Calling

• Three Possible Genotypes:– AA, BB, AB

• Construct a model that includes base quality to estimate the probability of error

• Calculate the probability of each genotype given the data and error rate

• Genotype with highest probability is called

Page 60: Bioc4010 lectures 1 and 2

The Model

Page 61: Bioc4010 lectures 1 and 2

The Model

m = ploidy (2)k = number of reads

g = genotype e = error probability

Page 62: Bioc4010 lectures 1 and 2

The Model

Reads that match reference

Page 63: Bioc4010 lectures 1 and 2

The Model

Reads that don’t match reference

Page 64: Bioc4010 lectures 1 and 2

Improving Variant Calling

• Two widely used tool sets for calling variants– samtools (uses MAQ-type calculation)– Genome Analysis Toolkit (GATK)

UnifiedGenotyper• UnifiedGenotyper: Capable of calling both

indels and single nucleotide polymorphisms (SNPs) and allele frequencies given multiple samples

Page 65: Bioc4010 lectures 1 and 2

UnifiedGenotyper

Apply filters to discard poor reads and remove biases:① Duplicate reads② Malformed reads (i.e. mismatch in #bases and base

qualities)③ Bad mate (paired-end sequencing, paired reads map to

different chromosomes)④ Mapping quality zero (maps to multiple locations

equally well)⑤ Fewer than 10% mismatch on read in 20bp to either

side of position

Page 66: Bioc4010 lectures 1 and 2

Remove Duplicate ReadsApplication Avg

#Molecules/Library

Read Length Avg #Molecules Sampled

Molecules Sampled > 1

30X Genome 5bn 2x100 450m 4.4%

4x Genome 5bn 2x100 60m 0.6%

100x Exome 500m 2x75 20m 2.0%

Duplicate reads break the assumption of independent sampling from the library

Identify reads with identical start/stop positions

Page 67: Bioc4010 lectures 1 and 2

Sequencer-Specific Error Models

Predicted Base

A C G T

Actual Base

A - 57.7 17.1 25.2

C 34.9 - 11.3 53.9

G 31.9 5.1 - 63.0

T 45.9 22.1 32.0 -

If a base was miscalled, what is it most likely to be called as instead?

Page 68: Bioc4010 lectures 1 and 2

Variant Calling

• SNP Calls infested with False Positives– Machine artifacts– Mis-mapped reads– Mis-aligned indels

• 5 – 20% false positive rate

Page 69: Bioc4010 lectures 1 and 2

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.

Page 70: Bioc4010 lectures 1 and 2

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants

Page 71: Bioc4010 lectures 1 and 2

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants

• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage

Page 72: Bioc4010 lectures 1 and 2

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants

• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage– Pro: Won’t miss real variants– Con: Many more false positives

Page 73: Bioc4010 lectures 1 and 2

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants

• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage– Con: False positives– Pro: Won’t miss real variants

Page 74: Bioc4010 lectures 1 and 2

How Good Are My Calls?

• How many called SNPs?– Human average of 1 heterozygous SNP / 1000

bases• Fraction of variants already in dbSNP• Transition/Transversion ratio

– Transitions 2x as common• 2.8x when looking only at exons

Page 75: Bioc4010 lectures 1 and 2

ANNOTATING VARIANTS

Page 76: Bioc4010 lectures 1 and 2

Identifying Genetic Variation Causing Genetic Disease

Page 77: Bioc4010 lectures 1 and 2

Discovering Genetic Variants Causing Mendelian Disease

4 million genetic variants

2 million associated with protein-coding genes

10,000 possibly of disease

causing type

1500 <1% frequency in population

Page 78: Bioc4010 lectures 1 and 2

Discovering Genetic Variants Causing Mendelian Disease

4 million genetic variants

2 million associated with protein-coding genes

10,000 possibly of disease

causing type

1500 <1% frequency in population

Single Causal Genetic Variant

Page 79: Bioc4010 lectures 1 and 2

If a problem cannot be solved, enlarge it.

--Dwight D. Eisenhower

Page 80: Bioc4010 lectures 1 and 2

TYPES OF SINGLE NUCLEOTIDE VARIANTS

Page 81: Bioc4010 lectures 1 and 2

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

StartTAAStop

Page 82: Bioc4010 lectures 1 and 2

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

StartTAAStopmRNA coding for protein

Splice Sites

Page 83: Bioc4010 lectures 1 and 2

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

Page 84: Bioc4010 lectures 1 and 2

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyr

Page 85: Bioc4010 lectures 1 and 2

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyrSplice Site Loss

Page 86: Bioc4010 lectures 1 and 2

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyrSplice Site Loss

Missense

Page 87: Bioc4010 lectures 1 and 2

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyrSplice Site Loss

Missense/Frameshift Stop Gain

Page 88: Bioc4010 lectures 1 and 2

GENETIC REGIONS OF INTEREST

Page 89: Bioc4010 lectures 1 and 2

Identifying Genetic Regions of Interest

Page 90: Bioc4010 lectures 1 and 2

Number of Genes in Genomic Regions of Interest

Page 91: Bioc4010 lectures 1 and 2

FREQUENCY OF GENETIC VARIANTS

Page 92: Bioc4010 lectures 1 and 2

Frequency of Polymorphisms: Common vs Rare

• Mendelian disorders are caused by rare variation, < 1-2% frequency in the relevant population

• Leverage large projects aimed at assessing genetic diversity in populations around the world– 1000 Genomes– NHLBI Exome Sequencing Project

Page 93: Bioc4010 lectures 1 and 2

Human Populations

Page 94: Bioc4010 lectures 1 and 2

Population Matters

• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes,

pathogen exposure and urban living

Page 95: Bioc4010 lectures 1 and 2

Population Matters

• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes, pathogen

exposure and urban living• Monogenic diseases have different prevalence in

different populations– Cystic fibrosis in European population– Hereditary hemochromotosis in Northern Europeans– Tay-Sachs in Ashkenazi Jews– Sickle-Cell anemia in Sub-saharan Africa populations

Page 96: Bioc4010 lectures 1 and 2

Population Matters

• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes, pathogen

exposure and urban living• Monogenic diseases have different prevalence in

different populations– Cystic fibrosis in European population– Hereditary hemochromotosis in Northern Europeans– Tay-Sachs in Ashkenazi Jews– Sickle-Cell anemia in Sub-saharan Africa populations

• Polygenic disorders

Page 97: Bioc4010 lectures 1 and 2

1000 Genomes Project

Page 98: Bioc4010 lectures 1 and 2

Exome Sequencing Project

• Multi-Institutional• Total possible patient pool of > 250,000

individuals, well phenotyped– Includes healthy individuals and diseased

• Currently 6700 exomes sequenced– 4420 European descent– 2312 African American

• 1.2 million coding variations– Most extremely rare/unique– Many population specific

Page 99: Bioc4010 lectures 1 and 2

IGNITE Project: Local Controls

• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada

• Atlantic Canada harbours several non-represented population groups and sub-groups…

Page 100: Bioc4010 lectures 1 and 2

IGNITE Project: Local Controls

• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada

• Atlantic Canada harbours several non-represented population groups and sub-groups…– Acadians– Native American– Non-Acadian/European Descent

Page 101: Bioc4010 lectures 1 and 2

Population Frequency

• Mendelian disorders are rare• If variation is in database, is it associated with

disease?• Causal variation also needs to be rare

– Cutoff somewhere in the < 0.5 - < 1.5% range– Should appear rarely or not at all in local controls– Track with disease in family members under study

Page 102: Bioc4010 lectures 1 and 2

Predicting the Impact of Missense Mutations

• Most use some level of evolutionary conservation to determine how severe a mutation is– SIFT– PolyPhen– GERP++– EvoD

Page 103: Bioc4010 lectures 1 and 2

Example: SIFT Algorithm

Input Query Sequence

Psi-BLAST

Homologs

Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

PSSM

NormalizeBy most

frequent AA

Score

Page 104: Bioc4010 lectures 1 and 2

Predicting Impact

• Other approaches include additional features:– Protein structure information– Site level annotation (active sites, binding sites,

etc)– Protein domain information– Biophysical properties of amino acids in that

position and of the substituted amino acid

Page 105: Bioc4010 lectures 1 and 2

Prediction Take-Away

The more conserved a site is the more likely any substitution is to be deleterious

However: Current methods have pretty poor performance, not suitable for clinical-level diagnosis

Page 106: Bioc4010 lectures 1 and 2

Classifying Genetic Variants

4 million variants

Intronic

Unknown Splice Site

Potential Disease Causing

Exonic

Amino Acid Changing

Known Genetic Disease Variant

Stop Loss / Stop Gain

Missense Mutation

Known Polymorphism in

Population

Silent Mutation Splice Site

Potential Disease Causing

Intergenic

Page 107: Bioc4010 lectures 1 and 2

GENE LEVEL ANNOTATION

Page 108: Bioc4010 lectures 1 and 2

Annotating Genes and Variants

• Is variant in a known protein-coding gene?– What does the gene do?– What molecular pathways?– What protein-protein interactions?– What tissues is it expressed in?– When in development?

4 million genetic variants

2 million associated with protein-coding genes

10,000 possibly of disease

causing type

1500 <1% frequency in population

Page 109: Bioc4010 lectures 1 and 2

Gene Level Annotations

Page 110: Bioc4010 lectures 1 and 2

ADDING ANNOTATIONS TO VARIANTS

Page 111: Bioc4010 lectures 1 and 2

Genomic Intervals, Searching, and Annotation

• Most common way of describing genomic features is as an interval

• Multiple formats (BED, WIG, VCF, etc)• In common for all is location:

– Chromosome– Start Position of Feature– End Position of Feature– Annotations/Info (Optional)

Page 112: Bioc4010 lectures 1 and 2

Searching and Annotating: Interval Trees

• Interval Trees allow efficient searching of all overlapping intervals

• Easiest to make one tree per chromosome• Given a set of intervals (n) on a number line

(chromosome) construct a tree

Page 113: Bioc4010 lectures 1 and 2

Interval Trees

All intervals to rightAll intervals to left

Node Contains:

- Centre point

- Intervals sorted by start

- Intervals sorted by end

Page 114: Bioc4010 lectures 1 and 2

CASE STUDIESIGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa

Page 115: Bioc4010 lectures 1 and 2

IGNITE Data Pipeline and Integration

Mapped Region(s)

Known Genes

Gene Definitions

Pathway and Interactions

Annotated Genomic Variants

FilterSort

Prioritize

Gene Annotations

Page 116: Bioc4010 lectures 1 and 2

Brain Calcification

Page 117: Bioc4010 lectures 1 and 2

Brain Calcification

• 84 genes in chromosome 5 region• No likely homozygous or compound heterozygous

variants within region shared between two patients• 29 genes with at least one targeted region with little

or no sequencing coverage• Many only lacked coverage in 5’ and 3’ UTRs• Collaborators performed statistical tests for possibly

copy-number variations of targeted regions using exome sequencing data

Page 118: Bioc4010 lectures 1 and 2

Brain Calcification

Page 119: Bioc4010 lectures 1 and 2

Charcot-Marie-Tooth: Genetic Mapping

Chromosome 9:120,962,282 -133,033,431

Page 120: Bioc4010 lectures 1 and 2

Cutis Laxa: Genetic Mapping

Chromosome 17:79,596,811-81,041,077

Page 121: Bioc4010 lectures 1 and 2

Charcot-Marie-Tooth Cutis Laxa• 143 genes in region• 13 known causative genes

– MPZ– PMP22– GDAP1– KIF1B– MFN2– SOX– EGR2– DNM2– RAB7– LITAF (SIMPLE)– GARS– YARS– LMNA

• 52 genes in region• 5 known causative

genes– ATP6V0A2– ELN– FBLN5– EFEMP2– SCYL1BP1– ALDH18A1

Page 122: Bioc4010 lectures 1 and 2

Pathway and Interaction Data

• 37 pathways– Clathrin-derived vesicle

budding– Lysosome vesicle

biogenesis– Endocytosis– Golgi-associated vesicle

biogenesis– Membrane trafficking– Trans-Golgi network vesicle

budding

• Primarily LMNA or DNM2

• 10 pathways– Phagosome– Collecting duct acid

secretion– Lysosome– Protein digestion and

absorption– Metabolic pathways– Oxidative phosphorylation– Arginine and proline

metabolism

• Primarily ATP6V0A2

Page 123: Bioc4010 lectures 1 and 2

Results: Charcot-Marie-Tooth

• 8 Genes PrioritizedGene Interactions PathwayLRSAM1 Multiple EndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis

• For more information– Guernsey et al (2010) PLoS Genetics. 6(8): e1001081

Page 124: Bioc4010 lectures 1 and 2

Results: Cutis Laxa• 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways, Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation

• For more information – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9