Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S....

20
36626 - Next Generation Sequencing Analysis Data Preprocessing Next Generation Sequencing analysis DTU Bioinformatics

Transcript of Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S....

Page 1: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group

Data PreprocessingNext Generation Sequencing analysis

DTU Bioinformatics

Page 2: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Generalized NGS analysis

Raw reads

Pre-processing

Assembly:Alignment /

de novo

Application specific:

Variant calling,count matrix, ...

Comparesamples / methods

Answer?Question

Dat

a si

ze

Page 3: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Generalized NGS analysis

Raw reads

Pre-processing

Assembly:Alignment /

de novo

Application specific:

Variant calling,count matrix, ...

Comparesamples / methods

Answer?Question

Dat

a si

ze

Page 4: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Assembly: Two basic approaches

• Alignment: Use a reference genome and align your reads to the genome

• de novo assembly: Try to assemble the reads into a genome without any prior knowledge

Page 5: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Assembly: Two basic approaches

• Alignment: Use a reference genome and align your reads to the genome

• de novo assembly: Try to assemble the reads into a genome without any prior knowledge

Monday

Wednesday

Page 6: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Assembly: Two basic approaches

• Alignment: Use a reference genome and align your reads to the genome

• de novo assembly: Try to assemble the reads into a genome without any prior knowledge

Monday

Wednesday

But first a look at data preprocessing

Page 7: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Preprocessing• Reads have qualities - bases are not always correct!

• Different error profiles pr. technology

• What can we do?

• Quality trimming

• Adaptor clipping

• 5’ clipping

• k-mer correction

• ...

Page 8: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Analyze data using FastQC• Report basic statistics on your

data

• Identify issues with your data

Page 9: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Per base sequence quality

Trim from 3’ to qual 20

Illumina

Page 10: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Average quality

Remove reads with avg. qual < 20

Illumina

Remove reads with “N” basecalls

Page 11: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Trim from 5’• Sometimes something is fishy in the beginning of the

read

Clip a certain number of bases from 5’

Page 12: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Adapters• Sometimes adapters/primers are also part of the read

• Adapter/primers are non-biological sequences

• Short read alignment is global - adapters are no-go

• de novo assembly will be confused ~ artificial repeats

• If you dont know which were used: FastQC will (may) find them for you!

Page 13: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Adapters - example

We will use “Cutadapt” and “AdapterRemoval” to cut adapters, many other options exist

Very important if your DNA fragment is shorter than read length

Page 14: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

454 / ion torrent data• Main problem is indels at

homopolymer runs

• (Trim homopolymers), trim trailing poor quality bases

• Remove very short reads

• For de novo adapters should be removed (prinseq)

• For alignment we use Smith-Waterman (local) so less important

Prinseq output

Page 15: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

k-mer correction• What is a k-mer?

• Create a sliding window of size k, move it over all your reads and count occurrence of k-mers

• We can use this to correct sequencing errors!

ACGTGTAACGTGACGTTGGADNA:

Eg. k=5ACGTGCGTGTGTGTA

Page 16: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

k-mer correctionsuch that the probability that a randomly selected k-mer

from the space of 42

k(for odd k considering reverse

complements as equivalent) possible k-mers occurs in arandom sequence of nucleotides the size of thesequenced genome G is ~0.01. That, is we want k suchthat

2

40 01

Gk� . (2)

which simplifies to

k G� log4 200 (3)

For an approximately 5 Mbp such as E. coli, we set kto 15, and for the approximately 3 Gbp human genome,we set k to 19 (rounding down for computational rea-sons). For the human genome, counting all 19-mers inthe reads is not a trivial task, requiring >100 GB ofRAM to store the k-mers and counts, many of whichare artifacts of sequencing errors. Instead of executingthis computation on a single large memory machine, weharnessed the power of many small memory machinesworking in parallel on different batches of reads. Weexecute the analysis using Hadoop [43] to monitor theworkflow, and also to sum together the partial countscomputed on individual machines using an extension ofthe MapReduce word counting algorithm [45]. TheHadoop cluster used in these experiments contains 10nodes, each with a dual core 3.2 gigahertz Intel Xeonprocessors, 4 GB of RAM, and 367 GB local disk (20cores, 40 GB RAM, 3.6 TB local disk total).In order to better differentiate true k-mers and error

k-mers, we incorporate the quality values into k-mercounting. The number of appearances of low coveragetrue k-mers and high copy error k-mers may be similar,but we expect the error k-mers to have lower qualitybase calls. Rather than increment a k-mer’s coverage byone for every occurrence, we increment it by the pro-duct of the probabilities that the base calls in the k-merare correct as defined by the quality values. We refer tothis process as q-mer counting. q-mer counts approxi-mate the expected coverage of a k-mer over the errordistribution specified by the read’s quality values. Bycounting q-mers, we are able to better differentiatebetween true k-mers that were sequenced to low cover-age and error k-mers that occurred multiple times dueto bias or repetitive sequence.

Coverage cutoffA histogram of q-mer counts shows a mixture of twodistributions - the coverage of true k-mers, and the cov-erage of error k-mers (see Figure 3). Inevitably, these

distributions will mix and the cutoff at which true anderror k-mers are differentiated must be chosen carefully[46]. By defining these two distributions, we can calcu-late the ratio of likelihoods that a k-mer at a given cov-erage came from one distribution or the other. Then thecutoff can be set to correspond to a likelihood ratio thatsuits the application of the sequencing. For instance,mistaking low coverage k-mers for errors will removetrue sequence, fragmenting a de novo genome assemblyand potentially creating mis-assemblies at repeats. Toavoid this, we can set the cutoff to a point where theratio of error k-mers to true k-mers is high, for example1,000:1.In theory, the true k-mer coverage distribution should

be Poisson, but Illumina sequencing has biases that addvariance [26]. Instead, we model true k-mer coverage asGaussian to allow a free parameter for the variance.k-mers that occur multiple times in the genome due torepetitive sequence and duplications also complicate thedistribution. We found that k-mer copy number in var-ious genomes has a ‘heavy tail’ (meaning the tail of thedistribution is not exponentially bounded) that is

Coverage

Den

sity

0 20 40 60 80 100

0.00

00.

005

0.01

00.

015

True k-mers

Error k-mers

Figure 3 k-mer coverage. 15-mer coverage model fit to 76×coverage of 36 bp reads from E. coli. Note that the expectedcoverage of a k-mer in the genome using reads of length L will beL kL

− + 1 times the expected coverage of a single nucleotidebecause the full k-mer must be covered by the read. Above, q -mercounts are binned at integers in the histogram. The error k-merdistribution rises outside the displayed region to 0.032 at coveragetwo and 0.691 at coverage one. The mixture parameter for the priorprobability that a k-mer’s coverage is from the error distribution is0.73. The mean and variance for true k-mers are 41 and 77suggesting that a coverage bias exists as the variance is almosttwice the theoretical 41 suggested by the Poisson distribution. Thelikelihood ratio of error to true k-mer is one at a coverage of seven,but we may choose a smaller cutoff for some applications.

Kelley et al. Genome Biology 2010, 11:R116http://genomebiology.com/2010/11/11/R116

Page 9 of 13

ACGTGGTTGCCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAAACGTGGTTACCCTTAAA

Kelley et al., 2010

Concept: Rare k-mers are seq. errorsNeed >15X coverage

Page 17: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Merge paired ends

Insert size: 500ntReads: 100ntMiddle: 300nt

Insert size: 180ntReads: 100ntMiddle: -20nt

• Merge overlapping pairs: single longer read

• Smart because Illumina reads have bad 3’ quals

• Very useful for de novo assembly

Magocˇ and Salzberg, 2011

Page 18: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Merge paired ends

Insert size: 500ntReads: 100ntMiddle: 300nt

Insert size: 180ntReads: 100ntMiddle: -20nt

Overlap

• Merge overlapping pairs: single longer read

• Smart because Illumina reads have bad 3’ quals

• Very useful for de novo assembly

Magocˇ and Salzberg, 2011

Page 19: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Coverage• Coverage/depth is how many times that your data covers the genome

(on average)

• Example:

• N: Number of reads: 5 mill

• L: Read length: 100

• G: Genome size: 5 Mbases

• C = 5*100/5 = 100X

• On average there are 100 reads covering each position in the genome

BackgroundTechnologies

DataPrimary analysis

Secondary analysis and beyond

File formatsMapping readsDe novo assemblySNP callingQuantificationSoftware

Coverage

C = N ⇥L

G

G : genome size

N : number of reads

L : average read length

Example: 1,500,000,000 of 100nt reads corresponds to a humangenome at 50x

34 / 86

Page 20: Data Preprocessing Preprocessing and SNP calling · Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation

36626 - Next Generation Sequencing Analysis

Last, but important!

• Lots of data - storage is expensive!

• Keep data compressed whenever possible (gzip, bzip, bam)

• Remove intermediate files and files that can easily be re-created