LD-Based Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads
24A - Mapping Short Sequencing Reads
Transcript of 24A - Mapping Short Sequencing Reads
-
8/6/2019 24A - Mapping Short Sequencing Reads
1/17
Mapping short sequencing
reads
GENOME 373: Genomic
InformaticsProf. William Stafford Noble
-
8/6/2019 24A - Mapping Short Sequencing Reads
2/17
One-minute responses
All these sequencing methods are for sequencing genomes, right? Or isthis for replicating DNA to perform testing and analysis? The methods we discussed are just for sequencing genomes (or metagenomes
from microbial communities).
Im sorry but Im still a little confused by todays sequencing content. Themethod we use today is Sanger, right? For we can get several hundreds of
sequences every time instead of 36bp or 35bp that read by Solexa orSOLiD. Yes, the method you use today is Sanger sequencing. The reads are a lot
longer, but they are more expensive. Next generation technology works byproducing lots of very short reads.
Today was a bit technical for a brain-dead Friday, but still interesting.
Todays lecture material was terrific this material is why I enrolled. Some
explanations a little hand-wavy, but Im excited to read more. Seeing some applications would be great.
Today was very interesting! It was also more straightforward.
My friend works in a nanopore sequencing lab in physics, seems really cool. Your friend, or the sequencing?
-
8/6/2019 24A - Mapping Short Sequencing Reads
3/17
One-minute responses
How can t statistic be negative, when denominator is anabsolute value? Sorry; the formula I gave you is for the two-tailed test. If you
want to do one-tailed, then you have to remove the absolute
value. Does compute the t statistic itself tell us something, or is
it necessary to compute the p-value? The t statistic itself is not of much use without the p-value
calculation.
Why is there a difference in the FDR calculation forpeptides versus microarrays? See the next three slides.
-
8/6/2019 24A - Mapping Short Sequencing Reads
4/17
Estimating false discovery ratePSMs sorted
by XCorr
FDR = 0/5 = 0%
FDR = 1/7 = 14%
FDR = 2/10 = 20%
SpectraTarget
peptide
database
Decoypeptide
database
SEQUEST
Target
peptide-spectrum
matches
Decoy
peptide-spectrum
matches
-
8/6/2019 24A - Mapping Short Sequencing Reads
5/17
False discovery rate
The false discovery rate (FDR)is the percentage of genesabove a given position in theranked list that are expected tobe false positives.
In microarray analysis, FDR isthe percentage of flaggedgenes that are not differentiallyexpressed.
We can estimate the number
of errors from the t-test p-values (details omitted).
5 FP
13 TP
33 TN
5 FN
FDR = FP / (FP + TP) = 5/18 = 27.8%
-
8/6/2019 24A - Mapping Short Sequencing Reads
6/17
What is the difference?
For PSMs, we use an explicitnull model.
Color indicates whether thePSM is target or decoy.
For gene expression, we usean analytic null.
Color indicates whether thegene is actually differentiallyexpressed or not.
In either case, the false
discovery rate is the estimatedpercentage of items(genes/PSMs) above thethreshold that are incorrect.
PSMs sortedby XCorr
FDR = 2/10 = 20%
5 FP
13 TP
33 TN
5 FN
-
8/6/2019 24A - Mapping Short Sequencing Reads
7/17
Short read mapping
Input:
A reference genome
A collection of many 25-100bp tags
User-specified parameters
Output:
One or more genomic coordinates for each tag
In practice, only 70-75% of tags successfully
map to the reference genome. Why?
-
8/6/2019 24A - Mapping Short Sequencing Reads
8/17
Multiple mapping
A single tag may occur more than once inthe reference genome.
The user may choose to ignore tags that
appear more than n times. As n gets large, you get more data, but
also more noise in the data.
-
8/6/2019 24A - Mapping Short Sequencing Reads
9/17
Inexact matching
An observed tag may not exactly match any position in the referencegenome.
Sometimes, the tag almostmatches one or more positions.
Such mismatches may represent a SNP or a bad read-out.
The user can specify the maximum number of mismatches, or a
phred-style quality score threshold. As the number of allowed mismatches goes up, the number ofmapped tags increases, but so does the number of incorrectlymapped tags.
?
-
8/6/2019 24A - Mapping Short Sequencing Reads
10/17
Short-read analysis software
-
8/6/2019 24A - Mapping Short Sequencing Reads
11/17
Spaced seed
alignment Tags and tag-sized
pieces of reference are
cut into small seeds.
Pairs of spaced seedsare stored in an index.
Look up spaced seeds for
each tag.
For each hit, confirm the
remaining positions.
Report results to the user.
-
8/6/2019 24A - Mapping Short Sequencing Reads
12/17
Burrows-Wheeler
Store entire reference
genome.
Align tag base by base
from the end. When tag is traversed, all
active locations are
reported.
If no match is found, then
back up and try a
substitution.
-
8/6/2019 24A - Mapping Short Sequencing Reads
13/17
Comparison
Burrows-Wheeler
Requires
-
8/6/2019 24A - Mapping Short Sequencing Reads
14/17
Spliced-read mapping
Used for processed mRNA data.
Reports reads that span introns.
Examples: TopHat, ERANGE
-
8/6/2019 24A - Mapping Short Sequencing Reads
15/17
Remaining lectures
Short read mapping case studies
Phylogenetics (1-2 lectures)
UCSC Genome Browser Practical computational biology
-
8/6/2019 24A - Mapping Short Sequencing Reads
16/17
Problem #1
Modify the program find-unique-
tags.py to report the location of each tag
in the genome.
Use loops, rather than string methods.> python map-tags.py genome.txt tags.txt locations.txt
Read 18917 bases in 4 chromosomes from genome.txt.
Read 1196 tags from tags.txt.
Mapped to 41122 locations.
-
8/6/2019 24A - Mapping Short Sequencing Reads
17/17
Problem #2
Assume that you do not have enoughmemory to store the entire genome.
Modify the program map-tags.py to firstread the tags into memory, and then scanthe genome once.
The output should stay the same, but in a
different order.> python map-tags2.py genome.txt tags.txt locations.txt
Read 8372 bases in 1196 sequences from tags.txt.
Read 4 chromosomes from genome.txt.
Mapped to 41122 locations.