Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
-
Upload
dina-wilcox -
Category
Documents
-
view
215 -
download
0
Transcript of Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Alignment
ATCGGGAATGCCGTTAACGGTTGGCGT
Reference genome
Human genome is about 3 billion base pair (3,000,000,000)in length.If read is 100 bp long, what is the probability of unique alignment?
1/(4x4x4…4) =1/4100 =1/1.60694E+60
Alignment Tools
• BWA http://bio-bwa.sourceforge.net/
• Bowtie http://bowtie-bio.sourceforge.net/index.shtml
Doing accurate alignment for a 30 million reads will take 30 million x 3billion time units.
Both are based on Borrows-Wheeler Algorithm
Alignment Results – Bam files
• SAM – uncompressed• Bam – compressed• http://
samtools.github.io/hts-specs/SAMv1.pdf• Sort and index before performing analysis• Don’t forget to perform QC on alignment
SNP calling
• GATK https://www.broadinstitute.org/gatk/
• Varscan http://varscan.sourceforge.net/
Somatic Mutation
• Different from SNP (not germline)• Both tumor and normal samples are needed
to accurately define a somatic mutation
• Tumor sample is almost never 100% tumor
Somatic mutation callers
• MuTect http://www.broadinstitute.org/cancer/cga/mutect
• Varscan http://varscan.sourceforge.net/
Quality Control on SNPs
• Number of Novel Non-synonymous SNP ~ 100 – 200
• Transition / transversion ratio• Heterozygous / non reference
homozygous ratio• Heterozygous consistency• Strand Bias• Cycle Bias
Strand Bias
Table 1 . Strand bias examples from real data
Chr Pos depth a1 b2 c3 d4
Forward Strand Genotype
Reverse Strand Genotype
6 32975014 21 5 5 10 1Heterzygous Homozygous
1 81967962 38 20 11 7 0Heterzygous Homozygous
12 10215654 31 15 9 7 0Heterzygous Homozygous
1. Forward strand reference allele
2. Forward strand non reference allele
3. Reverse strand reference allele
4. Reverse strand non reference allele
Pooled Analysis
• Pool samples together without barcode• Save money• Can only be used to evaluate allele frequency
Known – Things we always know that Sequencing data can do
SNV, mutation
CNV
Xie et al. BMC Bioinformatics 2009
Structural VariantsAlkan et al. Nature Review Genetics, 2011
Known Unknown – Other information we found that sequencing data contain
KnownKnown UnknownUnkown Unkown
SNVs and Mutations in non targeted regions Mitochondria
Virus and Microbe
How is additional data mining possible?
• Data mining is possible because capture techniques are not perfect.
Potential Functions of Intron and Intergenic
ENCODE suggested that over 80% human genome maybe functional.
Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic)
Coverage of the Unintended Regions
• The coverage don’t just drop off suddenly after the capture region end.
• Capture region example: chr1 1000 1500
1000 1500
1000 1500
Reads Aligned to Non Target Regions Can Be Used to Detect SNPs
• Tibetan exome study : Through exome sequencing of 50 Tibetan subjects, 2 intron SNPs were identified to be associated with high altitude. (Yi, et al. Science 2010)
• Non capture region study: Non capture region’s reads were studied to show they can infer reliable SNPs. (Guo, et al BMC Genomics)
Known unknown - Mitochondria
However, mitochondria is only 16569 BP
Assumptions: 40 mil reads 100BP long read
Extract mitochondria from exome sequencing
Tools:• Picardi et al. Nature Methods 2012• Guo et al. Bioinformatics, 2013 (MitoSeek)
Diagnosis:• Dinwiddie et al. Genmics 2013• Nemeth et al, Brain 2013
Virus
• Virus sequences can be captured through high throughput sequencing of human samples
• HBV in liver cancer samples (Sung, et al. Nature Genetics, 2012) (Jiang, et al. Genome Research, 2012)
• HPV in head and neck cancer (Chen, et al. Bioinformatics, 2012)
Tools for Detecting Virus from Sequencing data
• PathSeq (Kostic, et al. Nature, 2011 Biotechnology)
• VirusSeq (Chen, et al. Bioinformatics, 2012)• ViralFusionSeq (Li, et al. Bioinformatics, 2012)• VirusFinder (Wang, et al. PlOS ONE, 2013)
The Data Mining Ideas applied to RNA
• RNAseq has been used a replacement of microarray.
• Other application of RNAseq include dection of alternative splicing, and fusion genes.
• Additional data mining opportunities also available for RNAseq data
SNV and Indel
• Difficulty due to high false positive rate• RNAMapper (Miller, et al. Genome Research,
2013)• SNVQ (Duitama, et al. (BMC Genomics, 2013)• FX (Hong, et al. Bioinformatics, 2012)• OSA (Hu, et al. Binformatics, 2012)
Microsatellite instability
Examples:• Yoon, et al. Genome Research 2013• Zheng, et al. BMC Genomics, 2013
RNA Editing and Allele-specific expression
RNA editing tools and database• DARNED, REDidb, dbRES, RADAR
Allele-specific expression• asSeq (Sun, et al. Biometrics, 2012)• AlleleSeq (Rozowsky, et al. Molecular Systems
Biology, 2011)