Reference genome based sequence variation...
Transcript of Reference genome based sequence variation...
Reference genome based sequence variation detection
Computational Biology Service Unit (CBSU)Cornell Center for Comparative and Population Genomics (3CPG)
Center for Vertebrate Genomics (CVG)
CBSU/3CPG/CVG Joint Workshop Series
Assembly Alignment
Two different data analysis strategies
De novo Assembly
ACGGTACCTAAACCGGTACCTAAACCGGA
ACGAGCAACACGGTACCTA
TACCTAAACCGGACCCGGAAAGAC
ACGGTAGCTAAACCGGTAGCTAAACCGGA
ACGAGCAACACGGTAGCTA
TAGCTAAACCGGACCCGGAAAGAC
......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC..... ......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....
De novo Assembly
ACGGTACCTAAACCGGTACCTAAACCGGA
ACGAGCAACACGGTACCTA
TACCTAAACCGGACCCGGAAAGAC
ACGGTAGCTAAACCGGTAGCTAAACCGGA
ACGAGCAACACGGTAGCTA
TAGCTAAACCGGACCCGGAAAGAC
......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC..... ......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....
......ACGAGCAACACGGTACCTAAACCGGACCCGGAAAGAC.....
......ACGAGCAACACGGTAGCTAAACCGGACCCGGAAAGAC.....
ReferenceAlignment
ACGGTACCTAAACCGGTACCTAAACCGGA
ACGAGCAACACGGTACCTA
TACCTAAACCGGACCCGGAAAGAC
ACGGTAGCTAAACCGG
TAGCTAAACCGGA
ACGAGCAACACGGTAGCTA
TAGCTAAACCGGACCCGGAAAGAC
ReferenceAlignment
ACGGTACCTAAACCGGTACCTAAACCGGA
ACGAGCAACACGGTACCTA
TACCTAAACCGGACCCGGAAAGAC
ACGGTAGCTAAACCGG
TAGCTAAACCGGA
ACGAGCAACACGGTAGCTA
TAGCTAAACCGGACCCGGAAAGAC
ACGGTACCTAAACCGGTACCTAAACCGGA
ACGAGCAACACGGTACCTA
TACCTAAACCGGACCCGGAAAGAC
ACGGTAGCTAAACCGGTAGCTAAACCGGA
ACGAGCAACACGGTAGCTA
TAGCTAAACCGGACCCGGAAAGAC
Reference GenomeC
Chr Position Ref Coverage Depth Genotypes Genechr1 24515167 C 5 11 3 T() C() T()chr1 45396856 G 13 7 9 C() G() C()chr1 68417006 G 43 18 6 A() G() A()chr1 90162621 A 15 99 255M(AC) A() A()chr1 90162696 G 17 134 255 G() R(GA) G()chr1 90162750 C 19 108 176 Y(CT) Y(CT) C()chr1 90162816 G 30 72 106 G() K(GT) K(GT)chr1 90162975 G 162 48 255 G() R(GA) G()chr1 90163027 C 100 6 255 C() Y(CT) Y(CT)chr1 90163136 A 152 17 176 A() R(AG) R(AG)chr1 90163167 C 132 25 218 C() M(CA) M(CA)chr1 90163191 T 91 19 227 T() Y(TC) Y(TC)chr1 90164490 A 173 16 103 A() M(AC) M(AC)chr1 90164557 A 100 66 137 A() R(AG) A()chr1 90164612 A 62 48 107 A() R(AG) R(AG)chr1 90164677 A 88 37 64 R(AG) A() R(AG)chr1 90165817 T 88 35 56 Y(TC) Y(TC) T()… … … … … … … … …… … … … … … … … …chr17 72952985 C 23 26 31 T() Y(TC) T()chr18 7355152 G 23 34 3 A() G() A()chr18 7355177 A 16 29 3 C() A() C()chr18 25274226 T 28 35 22 C() Y(CT) C()chr18 34475963 A 25 12 25 G(KT) R(GA) G()chr18 38133671 G 69 63 21 C(SG) G() G()chr18 65363507 G 14 29 3 T(KG) G() T()chr18 65363509 T 18 31 3 G(KT) T() G()chr18 71606111 C 9 32 5 A() C() A()chr19 46381078 A 8 12 6 G(RA) A() G()
With limited number of individuals, whole genome/exomesequencing do not always reveal the causative mutations
Chr Position Ref Coverage Depth Genotypes Genechr1 24515167 C 5 11 3 T() C() T()chr1 45396856 G 13 7 9 C() G() C()chr1 68417006 G 43 18 6 A() G() A()chr1 90162621 A 15 99 255M(AC) A() A()chr1 90162696 G 17 134 255 G() R(GA) G()chr1 90162750 C 19 108 176 Y(CT) Y(CT) C()chr1 90162816 G 30 72 106 G() K(GT) K(GT)chr1 90162975 G 162 48 255 G() R(GA) G()chr1 90163027 C 100 6 255 C() Y(CT) Y(CT)chr1 90163136 A 152 17 176 A() R(AG) R(AG)chr1 90163167 C 132 25 218 C() M(CA) M(CA)chr1 90163191 T 91 19 227 T() Y(TC) Y(TC)chr1 90164490 A 173 16 103 A() M(AC) M(AC)chr1 90164557 A 100 66 137 A() R(AG) A()chr1 90164612 A 62 48 107 A() R(AG) R(AG)chr1 90164677 A 88 37 64 R(AG) A() R(AG)chr1 90165817 T 88 35 56 Y(TC) Y(TC) T()… … … … … … … … …… … … … … … … … …chr17 72952985 C 23 26 31 T() Y(TC) T()chr18 7355152 G 23 34 3 A() G() A()chr18 7355177 A 16 29 3 C() A() C()chr18 25274226 T 28 35 22 C() Y(CT) C()chr18 34475963 A 25 12 25 G(KT) R(GA) G()chr18 38133671 G 69 63 21 C(SG) G() G()chr18 65363507 G 14 29 3 T(KG) G() T()chr18 65363509 T 18 31 3 G(KT) T() G()chr18 71606111 C 9 32 5 A() C() A()chr19 46381078 A 8 12 6 G(RA) A() G()
With limited number of individuals, whole genome/exomesequencing do not always reveal the causative mutations
Sequence a mapping population
FASTQ files
SAM/BAM files
VCF file
Reference genome based sequence variation detection
Step 1: Alignment
Step 2: Call SNP/INDELs
Reference genome based sequence variation detection
Step 3: Filter SNP/INDELs
Step 4: Annotate SNP/INDELs
Reference genome based sequence variation detection
Step 1: Alignment
Step 2: Call SNP/INDELs
BWALi H. and Durbin R. (2009) Bioinformatics, 25:1754‐60
SAMtools GATK + PicardLi H. et al. Bioinformatics, 25, 2078‐9 Broad Institute
or
Reference genome based sequence variation detection
Step 3: Filtering
Step 4: Annotation
• GATK• Write your own code
• Annovarhttp://www.openbioinformatics.org/annovar/
Standard file formats
• FASTQ• SAM/BAM• VCF
@20F75AAXX:5:1:335:1565
ACCTTGTTGAGAAACAGGAGGTGTTGTTCTTCAAAG
+20F75AAXX:5:1:335:1565
]]]]][]][][[][]Z[[[][[[[][[[[][[[[[R
@20F75AAXX:5:1:466:1056
GGAAGCAACAGCTAATACATGAATGGATATCGATCG
+20F75AAXX:5:1:466:1056
[]]]]][]]]Y]]]][Y[[[[[[[[[[Y[Y[YW[[[
@20F75AAXX:5:1:256:1724
GCCCAACAAAGACCGGTCACCAAAGACAGATGATTC
+20F75AAXX:5:1:256:1724
]][]][]][[[[]L[[[[][[[Z[[[[[S[[ZW[[[
FASTQ file:
HWI‐EAS83_20F7TAAXX:1:1:379:338 16 4 157555988 25 36M * 0 0
AGAAAACTGCAAAGCACGAGTCTAGCAGATACCCTT
h?DhhhLDPOhhhhhhhhhhhhhhhhhhhhhhhhhh XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0
MD:Z:2C32G0
HWI‐EAS83_20F7TAAXX:1:1:98:170 16 4 28122708 37 36M * 0 0
GCACCCTTTAACTCGGGCTAACTATCTTGCTTCACC
VbINbYZh_hUhQhd\^hfhhhhhhhhhhhhhhhhh XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:33G2
HWI‐EAS83_20F7TAAXX:1:1:582:80 4 * 0 0 * * 0 0
ATGGCTGCCTCGCAGAATCGAAAGTTAGTGCCGCAC
hfhhhhahh`hhAVhEhahQKHKQA_IIPPF@DhEV
HWI‐EAS83_20F7TAAXX:1:1:169:517 16 3 170277940 25 36M * 0 0
AAAACCATATCTGCTGGAAACTCTGCTTCCACAAGC
CDhKDBhDhFaGghMhahhhhPhhhhhhhhhhhhhh XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0
MD:Z:0T0C34
SAM file:
• Sequence (forward strand of the reference genome)
• Quality score
• Alignment information (position, strand, mismatches, gap)
• Ambigous alignments
• Paired‐end information
• Read group
Information encoded in SAM file
BAM is a compressed SAM file
• BAM file is several times smaller than SAM;
• BAM file can be indexed and queried;
• Most software operates directly on BAM;
• BAM format can potentially replace fastqformat.
##fileformat=VCFv4.0##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot‐NCBI36##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
VCF file ‐ variant call format
Alignment with BWA
Commonly used parameters:
Alignment step (aln):
‐n: maximum number of edit distance (default 0.04)
‐o: maximum number of gap opens (default 1)
Write SAM file step (samse or sampe):
‐n maximum number of alignments to report
‐ Converting SAM to BAM‐ Index BAM
*** If you want to use Broad GATK software to call SNPs, do not use SAMtools, always use Picard for processing SAM and BAM files.
Samtools: view; index
Picard: SamFormatConverter; BuildBamIndex
BAM file can be visualized with IGV software
Clean up the BAM file• Mark possible PCR duplicates
• Base quality score recalibration
• Local realignment around indels
Clean up the BAM file• Mark possible PCR duplicates
• Base quality score recalibration
• Local realignment around indels
** For sequence reads with exact same sequence, only one copy is kept.
Clean up the BAM file• Mark possible PCR duplicates
• Base quality score recalibration
• Local realignment around indels
• Phred quality score: 20 ‐> 1% error rate.
• Illumina quality score: 0 to 62, need to be calibrated to reflect error rate.
Clean up the BAM file• Mark possible PCR duplicates
• Base quality score recalibration
• Local realignment around indels
Multi‐sample SNP and INDEL calling
• Use Unified Genotyper (GATK) or mpileup(SAMtools) to call SNP and INDEL from multiple samples.
• Set the variants calling thresholdEmission threshold: Q10 (>10x) Q3(<10x)Confidence threshold: Q30(>10x) Q4(<10x)
Filtering
• Read depth (DP)
• Allele frequency (AF)
• Number of samples with data (NS)
• SAM ‐> BAM
• Flag possible PCR duplicates
• Quality score calibration
• INDEL realignment
• Call variants on multiple samples
• Filtering
SAMtools GATK/Picard
* SAMtools mpileup has built‐in realignment tool** Limited filtering function. Poor documentation.
*
**
GATK Documentation:http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2
SAMtools Variants Calling Documentation:http://samtools.sourceforge.net/mpileup.shtml
1. Experimental Design.
2. Computational Resource at Cornell.
Practical aspects
Whole genome sequencing vs
Targeted sequencing
Target‐enrichment by array or in‐solution based capturing technology. (e.g. Exome sequencing).
ApeK I site
Line 1
Line 2
Line 3
Whole genome sequencing vs
Genotyping by Sequencing (GBS)
Ed Buckler Lab(http://www.maizegenetics.net/gbs‐overview)
Advantage of GBS over whole genome sequencing
1. Reduced cost by multiplexing;
2. Possible to map markers that are not on the reference genome;
To identify causative mutations in a mutant strain, it is necessary to use both sequencing
and genetic linkage analysis.
**
*
****
X
F1
F2
Mapping and Mutation Identification of the Pooled F2 population
SHOREmapSchneeberger K et al (2009) Nat Methods.6(8):550‐1.
Using SHOREmap for mapping and mutation identification
Zuryn et al. (2010) A Strategy for Direct Mapping and Identification of Mutationsby Whole‐Genome Sequencing. Genetics 186: 427–430
Alternative approach: test for enrichment of new mutations
Computational Resource at Cornell
CBSU / 3CPG BioHPC Laboratory (625 Rhodes Hall)
Office Hour: 1:00 to 3:00 PM every Monday.
Email [email protected] to get an BioHPC lab account.
Training workshops
• Linux for Biologists
• Programming workshop (PERL)