Introduction to second-generation...
Transcript of Introduction to second-generation...
![Page 1: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/1.jpg)
Introduction to second-generation
sequencingCMSC702 Spring 2014
Many slides courtesy of Ben Langmead (JHU)
![Page 2: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/2.jpg)
What Are We Measuring?
RNA-‐seq
![Page 3: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/3.jpg)
Transcrip7on
G T A A T C C T C
| | | | | | | | | C A T T A G G A G
DNA
G U A A U C C
RNA polymerase
mRNA
From DNA to mRNA
![Page 4: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/4.jpg)
Reverse transcrip7onClone cDNA strands, complementary to the mRNA
G U A A U C C U CReverse
transcriptase
mRNA
cDNA
C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G
T T A G G A G
C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G
C A T T A G G A G
![Page 5: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/5.jpg)
!
Corrada Bravo 10/30/09
Sec-gen Sequencing
!5
![Page 6: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/6.jpg)
!
Corrada Bravo 10/30/09
Sec-gen Sequencing
!6
Fragmentation is random, "i.e., not equal-sized (but hard to draw)
![Page 7: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/7.jpg)
!
Corrada Bravo 10/30/09
Sec-gen Sequencing
!7
![Page 8: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/8.jpg)
!
Corrada Bravo 10/30/09
Second-Generation Sequencing
• “Ultra high throughput” DNA sequencing"
• 3 gigabases / day vs."
• 3 gigabases / 13 years (human genome project, more or less)
!8
![Page 9: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/9.jpg)
Platforms
• Millions of short DNA fragments (~100 bp) sequenced in parallel
![Page 10: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/10.jpg)
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics. 2009
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
namesequencequality scores
x 100s of millions
![Page 11: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/11.jpg)
Sequencing throughput
HiSeq 2000 25 billion bp per day
(2010)
GA IIx 5 billion bp per day
(2009)
GA II 1.6 billion bp per day
(2008)
Images: www.illumina.com/systems
Numbers: www.politigenomics.com/next-generation-sequencing-informatics
Dates: Illumina press releases
![Page 12: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/12.jpg)
Sequencing throughput
HiSeq 2500 60 billion bp per day
(2012)
GA IIx 5 billion bp per day
(2009)
GA II 1.6 billion bp per day
(2008)
Images: www.illumina.com/systems
Numbers: www.politigenomics.com/next-generation-sequencing-informatics
Dates: Illumina press releases
![Page 13: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/13.jpg)
Throughput growth gap
4-5x per year 2x per 2 years
>
![Page 14: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/14.jpg)
ionTorrent
![Page 15: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/15.jpg)
Oxford Nanopore
• Nanopore technology
• ultralong reads (48kb genome sequenced as one read)
![Page 16: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/16.jpg)
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics. 2009
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
namesequencequality scores
x 100s of millions
(slide courtesy of Ben Langmead)
![Page 17: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/17.jpg)
From reads to evidence
![Page 18: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/18.jpg)
From reads to evidence2. Comparative
Sequence-wise, individuals of a species are nearly identical
Well curated, annotated “reference” genomes exist
D. melanogaster, Science, 2000 H. sapiens, Nature, 2000 M. musculus, Nature, 2002and Science, 2000
Idea: “Map” reads to their point of origin with respect to a reference, then study differences
![Page 19: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/19.jpg)
RNA-seq differential expression
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATAT GCCGGAGCACCCTATG
GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT !GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC !AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT !ATATATATATATATAT |||||||||||||||| ATATATATATATATAT !TCTCTCCCANNAGAGC ||||||||| ||||| TCTCTCCCAGGAGAGC
Align Aggregate
Statistics
Gene 1 differentially expressed?: YES !p-value: 0.0012
TGTCGCAGTATCTGTC AGCACCCTATGTCGCA GCCGGAGCACCCTATGGTCGCAGTANCTGTCT
||||||||| |||||| GTCGCAGTATCTGTCT !GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC !AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT !ATATATATATATATAT |||||||||||||||| ATATATATATATATAT !TCTCTCCCANNAGAGC ||||||||| |||||
Align Aggregate
Gene 1
Sample A
Sample B
![Page 20: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/20.jpg)
Mapping
CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC Take a read:
And a reference sequence:>MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA
How do we determine the read’s point of origin with respect to the reference?
CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC |||||| |||| |||| ||||||||| |||| ||||| CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA
Hypothesis 1:
Hypothesis 2:
CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC |||||||||||| ||||||||||||||||||||| ||||| | CTCAAACTCCTG-CCTTTGGTGATCCACCCGCCTTGGCCTAC
Answer: sequence similarity
Read
Reference
Read
Reference
Say hypothesis 2 is correct. Why are there still mismatches and gaps?
Which hypothesis is better?
![Page 21: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/21.jpg)
Mapping
CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC
>MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGTGATCCACCCAGCGCCTTGGCCTAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAA AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA GGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGATAGAATCTTAG TTCAACTTTAAATTTGCCCACAGAACCCTCTAAATCCCCTTGTAAATTTAACTGTTAGTC
This is an alignment: !!!!!!Software programs that compare reads to references and find alignments are aligners.
Read
Reference
Read
Reference
Alignment is computationally difficult because references (e.g. human) are very long (more than 1M times longer than what’s shown to the left) and sequencers produce data very rapidly, e.g. up to 25 billion bases per day in 2010.
Sequencing throughput increases by ~5x per year, whereas computers get faster at a rate closer to ~2x every 2 years.
CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC |||||| |||| |||| ||||||||| |||| ||||| CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA
![Page 22: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/22.jpg)
Mapping
CTCAAACTCCTGACCTTTGGTGATCCA Take a read:
And a reference sequence:>MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGTGATCCACCCAGCGCCTTGGCCTAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAA AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA
CTCAAACTCCTGACCTTTGGTGATCCA |||||||||||| |||||||||||||| CTCAAACTCCTGCCCTTTGGTGATCCA
Hypothesis 1:
Hypothesis 2:
Read
Reference
Read
Reference
Is there any way to break the tie?
Which hypothesis is better?
CTCAAACTCCTGACCTTTGGTGATCCA |||||||||||||||||| |||||||| CTCAAACTCCTGACCTTTCGTGATCCA
![Page 23: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/23.jpg)
Mapping
Recall that reads come with per-cycle quality values (in red) !In FASTQ format (left), qualities are encoded as ASCII characters like B, = or %, but really they’re integers [0, 40]
A quality value Q is a function of the probability P that the sequencing machine called the wrong base:
Q = 10: 1 in 10 chance that base was miscalled Q = 20: 1 in 100 chance Q = 30: 1 in 1000 chance !Higher is “better.”
Qs are estimated by the sequencer’s software and aren’t necessarily accurate
Q = −10 · log10(P )
![Page 24: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/24.jpg)
Mapping
CTCAAACTCCTGACCTTTGGTGATCCA Take a read:
And a reference sequence:>MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGTGATCCACCCAGCGCCTTGGCCTAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAA AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA
CTCAAACTCCTGACCTTTGGTGATCCA |||||||||||| |||||||||||||| CTCAAACTCCTGCCCTTTGGTGATCCA
Hypothesis 1:
Hypothesis 2:
Read
Reference
Read
Reference
Which hypothesis is better?
CTCAAACTCCTGACCTTTGGTGATCCA |||||||||||||||||| |||||||| CTCAAACTCCTGACCTTTCGTGATCCA
Q=30
Q=10
![Page 25: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/25.jpg)
Mapping
CTCAAACTCCTGACCTTTGGTGATCCA Take a read:
And a reference sequence:>MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGTGATCCACCCAGCGCCTTGGCCTAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAA AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA
CTCAAACTCCTGACCTTTGGTGATCCA |||||||||||| |||||||||||||| CTCAAACTCCTG-CCTTTGGTGATCCA
Hypothesis 1:
Hypothesis 2:
Read
Reference
Read
Reference
Is there any way to break the tie? !Hint: In Illumina sequencing, sequencing errors almost never manifest as gaps
Which hypothesis is better?
CTCAAACTCCTGACCTTTGGTGATCCA |||||||||||||||||| |||||||| CTCAAACTCCTGACCTTTCGTGATCCA
![Page 26: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/26.jpg)
Mapping
CTCAAACTCCTGACCTTTGGTGATCCA |||||||||||| |||||||||||||| CTCAAACTCCTG-CCTTTGGTGATCCA
Read
Reference
Aligners can employ penalties to account for the relative probability of seeing different dissimilarities
Estimates vary, but small gaps (“indels”) occur in humans at 1 in ~10-100K positions.
SNPs occur in humans at 1 in ~1K positions, but depending on Q, sequencing error may be more likely
Penalty = 45
CTCAAACTCCTGACCTTTGGTGATCCA ||||| |||||||||||||| |||||| CTCAA-CTCCTGACCTTTGGCGATCCA
Read
Reference
Penalty = 55
Q=10
CTCAAACTCCTGACCTTTGGTGATCCA |||||||||||||||||||||| |||| CTCAAACTCCTGACCTTTGGTGCTCCA
Read
Reference
Penalty = 30
Q=40
Pengap ≡ −10 log10(Pgap)
= −10 log10(0.00005)
≈ 45
Penmm ≡ argmin(−10 log10(Pmiscall),−10 log10(PSNP))
= argmin(Q,−10 log10(0.001))
= argmin(Q, 30)
![Page 27: Introduction to second-generation sequencingusers.umiacs.umd.edu/.../lect22_seqIntro/seqIntro.pdfCorrada Bravo 10/30/09 Second-Generation Sequencing • “Ultra high throughput”](https://reader034.fdocuments.in/reader034/viewer/2022050414/5f8aa6bb776cef3194554893/html5/thumbnails/27.jpg)
State of the Art• Bowtie: ultra-fast mapping of short reads to
reference genome
!
!
!
!
• http://bowtie-bio.sourceforge.net