Advances of Whole Genome Sequencing in Strawberry with NGS ...
Next-Generation Sequencing (NGS) T echnologies and Data Analysis · 2010-05-04 · Next-Generation...
Transcript of Next-Generation Sequencing (NGS) T echnologies and Data Analysis · 2010-05-04 · Next-Generation...
Next-Generation Sequencing (NGS) Technologies and Data Analysis
Christopher E. Mason
TA: Paul Zumbo
Spring 2010
Class #2: Alignments, QC, and data processing
!"#$%"&'()*"+,*-(,"
copy the version of BWA into the 1KG directory
!"#$%&'$(()*+,)$
+,."/-(,"0).)123+4"
,-.$/0-$#123-45-04$610-$
!$7.#8))7.#5.31"-(0"92(02:(;/<)*===;-0/>-?)7.#)41.1)@'=ABCD)?-EF-0"-G3-14)HII==*=*JG*(726.(71?.E(;K$
!$7.#8))7.#5.31"-(0"92(02:(;/<)*===;-0/>-?)7.#)41.1)@'=ABCD)?-EF-0"-G3-14)HII==*=*JGL(726.(71?.E(;K$
!"#$%&'()*'+%,*-.'
!$;F0K2#$M(;K$
/001'2('()*'32(2.'
!$6?$N6:$
40"#('()*',%#*-'56/7'847'9:;<=8$
!$O"$HII==*=*JGL(726.(71?.E$
>0'-0?*'?2().'
!$-P#3$LQJBJQ==$)$J$
http://www.1000genomes.org
5,67-6(8*9"!:89*(,*.;"R-37/3>$.:-$162;0>-0.$S160T8$
!()9O1$160$(();-0/>-?):;*B(71$HII==*=*JG*(726.(71?.E$U$HII==*=*JG*(?12$
!()9O1$160$(();-0/>-?):;*B(71$HII==*=*JGL(726.(71?.E$U$HII==*=*JGL(?12$
V.:-3$/#.2/0?$62?.-4$1.8$:..#8))92/59O1(?/F3"-7/3;-(0-.)9O1(?:.>6$
<-*=,6."/'778>"!66)?;".-"5-;8.8-*;",-0-31.-$'62;0>-0.?$20$W'X$7/3>1.$SW20;6-$H04$I-14?T$
!$()9O15=(D(Y)9O1$?1>?-$:;*B(71$HII==*=*JG*(?12$HII==*=*JG*(726.(71?.E$UHII==*=*JG*(?1>$
,-0-31.-$'62;0>-0.?$20$W'X$7/3>1.$SR123-4$H04$I-14?T$
!$()9O15=(D(Y)9O1$?1>#-$:;*B(71$HII==*=*JG*(?12$HII==*=*JGL(?12$HII==*=*JG*(726.(71?.E$
HII==*=*JGL(726.(71?.E$UHII==*=*JGRH(?1>$
!:;-@"?-'6"ABC";?;.,("D)*"()E,")"F877,6,*D,"
,-.$1$6F?.3-$726-?Z?.->$27$Z/F$"10[$S?:/3.$7/3$\20FP$]6F?.-3T$
S@[email protected]/3^$_26-$WZ?.->$SR`_WT$R13166-6$`23.F16$_26-$WZ?.->$Sa-331_W$Ta-331W"16-$a-":($_26-$WZ?.->$S,R_WTb%X$,-0-316$R13166-6$_26-$WZ?.->$
Sun Microsystems
%G6,)F;@"H66-6;@")*F"A*F,:;";.6-*9:?")77,D."
.G,"):89*(,*.;I";J,,F")*F")DD'6)D?"!$.2>-$()9O1$160$(();-0/>-?):;*B(71$HII==*=*JG*(726.(71?.E$
$ $!"#$%&'(')*
!$.2>-()9O1$160$N.$C$(();-0/>-?):;*B(71$HII==*=*JG*(726.(71?.E$
$ +#"&'"")*
!$.2>-$()9O1$160$N.$C$N-$*=$(();-0/>-?):;*B(71$HII==*=*JG*(726.(71?.E$
$ J>*=(YCJ?$
<@2?&,*-'A%()'2'+%,*'A%()'1#0A#'B2C%2#(-.'
)((&.DD&)E-%0,0FEG?*3GH0C#*,,G*3"D+2H",(ED?2-0#D,2ID32(2D6!JD'
!()9O1$160$(();-0/>-?):;*B(71$b04-6?(71?.E$Ub04-6?(?12$
!()9O1$?1>?-$(();-0/>-?):;*B(71$b04-6?(?12$b04-6?(71?.E$Ub04-6?(?1>$
60A'+%#3'-*K"*#H*-'A%()',2CF*C'%#3*,-'
!$()9O1$160$N-$*=$(();-0/>-?):;*B(71$b04-6?(71?.E$Ub04-6?G-*=(?12$
!$()9O1$?1>?-$(();-0/>-?):;*B(71$b04-6?G-*=(?12$b04-6?(71?.E$Ub04-6?G-*=(?1>$
3'6.-;8;")*F"/E,K*,;;"H;.8().,"5H"L<""
Skewness:
As close to zero as possible Kurtosis:
As high as possible (at least >0.6)
$G)."8;"/!MN"SAM is a rapidly developing data specification and format
for the storage of sequence alignments and their mapping coordinates.
Sequence Alignment/Map (SAM) also has a binary version of the format, called BAM.
SAMtools is a set of tools for manipulating and controlling SAM/BAM files
Bam-Bam of the Flintstones is currently unrelated to Heng Li and Richard Durbin’s work with SAM/BAM
/!M"C'.J'."
@HD = Header
@SQ = Sequence Dictionary LN=length of sequence
@RG= Read Group ID=unique read group identifier’
PU=Platform Unit
LB=Library SM=Sample
/!M"C'.J'."
QNAME = name of read
FLAG = Bitwise FLAG (216-1) RNAME = Reference sequence name
POS = Position (1-based) MAPQ = Mapping Quality (Phred-based)
CIGAR = CIGAR STRING
MRNM = Mate Reference Sequence MPOS = 1-based Mate Position of the other seq
ISIZE = Inferred Insert Size SEQ = Sequence reported on the + strand
QUAL = Quality scores (ASCII-33 = Phred)
TAG = TAG
#8.K8;,"O:)9;")6,"<-(P8*,F"#8.;" 0100101001010
Bit 0 = The read was part of a pair during sequencing
Bit 1 = The read is mapped in a pair Bit 2 = The query sequence is unmapped
Bit 3 = The mate is unmapped
Bit 4 = Strand of query (0=forward 1=reverse)
To find the value from the individual flags is additive. If the flag is false, don't add anything to the total. If it’s true then add 2 raised to the power of the bit position.
For example:
Bit 0 - false - add nothing
Bit 1 - true - add 2**1 = 2 Bit 2 - false - add nothing
Bit 3 - true - add 2**3 = 8
Bit 4 - true - add 2**4 = 16
Bit pattern = 11010 = 16+8+2 = 26 So the flag value would be 26.
Other Examples:
0=0000000
99 = 01100011 147 = 10010011
0 = Not paired, mapped, forward strand.
99 = Paired, Proper Pair, Mapped, Mate Mapped, Forward, Mate Reverse, First in pair, Not second in pair
147 = Paired, Proper Pair, Mapped, Mate Mapped, Reverse, Mate Forward, Not first in pair, Second in pair
#8.K8;,"O:)9"H>J:)*).8-*"
!"#$%& !"#$%'& ($)*+#,-.$& /0-1&
Q" 2"
8;R,)F5)6.C7!5)86,F!:89*
(,*." Q>QQQ2""
2" S"
8;R,)F!56-J,65)86,F!:89*
(,*." Q>QQQS""
S" T" 8;L',6?U*()JJ,F" Q>QQT""
V" W" 8;M).,U*M)JJ,F" Q>QQW""
T" 2X" 8;L',6?R,=,6;,/.6)*F"
Q>QQ2Q""
17):;,"YZ".6',"[4"
\" VS" 8;M).,R,=,6;,/.6)*F" Q>QQSQ""
X" XT" 8;R,)FO86;.5)86" Q>QQTQ""
]" 2SW" 8;R,)F/,D-*F5)86" Q>QQWQ""
W" S\X" 8;!:89*(,*.^-.568()6?" Q>Q2QQ""
_" \2S" F-,;R,)FO)8:`,*F-6L<" Q>QSQQ""
2Q" 2QST" 8;R,)F!0'J:8D).," Q>QTQQ""
/!M"/J,D878D).8-*;"SAM can store various alignments as a CIGAR format:
1.!Standard
2.!Clipped
a. soft-clipped= non-matched sequence present in alignment
b. Hard-clipped= non-matched sequence missing from alignment
4. Spliced (Intron (N))
5. Multi-part
6. Padded (Insertions (I) and Deletions (D))
7. Color-space
/!M"/J,D878D).8-*;"
Li et al, 2010
/-(,")(P89'8.8,;"6,()8*"CIGAR format is a short way of storing mis-aligned bases
to a reference genome.
In certain cases, CIGAR will need pileup-based padding,
though this is currently not supported.
$G)."87"A"F-*I."G)=,"7);.a"78:,;@"-6"
A"):6,)F?"G)=,"):89*(,*.;N"/-(,".--:;"):6,)F?",>8;.".-"DG)*9,"7-6().;"b"
"#$!"
"/!M.--:;"
" <-*=,6.,6;" " " "%--:;"samtools.pl
wgsim_eval.pl
blast2sam.pl
bowtie2sam.pl export2sam.pl
novo2sam.pl sam2vcf.pl
soap2sam.pl
zoom2sam.pl
qualfa2fq.pl
solid2fastq.pl (not recommended!)
56,F8D.8*9"+,*,.8D"`)68).8-*"K8.G"/!M.--:;"
_23?.c$Z/F$O266$0--4$./$;-.$1$.//6^2.c$104$O-$O266$F?-$W'X.//6?8$
:..#8))?/F3"-7/3;-(0-.)#3/d-".?)?1>.//6?)726-?)$
:..#8))?1>.//6?(?/F3"-7/3;-(0-.)$
e/O06/14$.:-$?/F3"-$"/4-8$
!$:..#8))?/F3"-7/3;-(0-.)#3/d-".?)?1>.//6?)726-?)?1>.//6?)=(*(Y)?1>.//6?5=(*(Y1(.13(9KL)4/O06/14$
f0K2#$.:-$.139166$S/3$4/F96-5"62"^$27$/0$e-?^./#T8$
!9K2#L$5"4$?1>.//6?5=(*(Y1(.13(9KL$g$.13$P<7$5$
]:10;-$20./$.:-$0-O$423-"./3Z$
!"4$?1>.//6?5=(*(Y1$
]/>#26-$.:-$#3/;31>$
!>1^-$
X/<-$.:-$-P-"F.196-$20./$>120$423-"./3Z$
!"#$?1>.//6?$(()$
/!M.--:;"()8*"-J.8-*;"
58:,'J".G,"R,)F;".-"D)::"=)68)*.;"!"#$%&'&()'%)*+,'-.&$'/01'23-.*%4'5016'7$%"*&''
!()?1>.//6?$2>#/3.$(();-0/>-?):;*B(71$b04-6?G-*=(?1>$b04-6?G-*=(91>$
5$%&'&()'/01'7-8)'27$%'7*,&)%'#%$9),,-.:'8*&)%6'
!()?1>.//6?$?/3.$b04-6?G-*=(91>$b04-6?G-*=(?/3.-4$$
;)%7$%"'*';-8)<#'28*4)%'&()'%)*+,'$.'&$#'$7')*9('$&()%6'
!$()?1>.//6?$#26-F#$5<"7$(();-0/>-?):;*B(71$b04-6?G-*=(?/3.-4(91>$Ub04-6?G-*=(#26-F#G31O$
!7'4$<'=*.&'&$'98)*.'<#'&()'#-8)<#'34'+)#&('$7'9$>)%*:)?'
!#-36$?1>.//6?5=(*(Y)?1>.//6?(#6$<13_26.-3$54$*=$726-(#26-F#(31O$Ub04-6?G-*=(#26-F#G*=h$
!7'4$<'=*.&'&$'98)*.'<#'&()'#-8)<#'34'@<*8-&4',9$%),'$7'@AB'29$8<".'C6D''!"#$%&'()'*+,-./'*")0'1.(.-'
2.3$4.-5.-0'#$/'6-3#$'7.-$35"#$'8*279:'
!1O^$ijAUkL=i$726-(#26-F#(31O$U726-(#26-F#GIXWL=(/F.$
E$'8$$F'98$,)%'*&'4$<%'8-,&'$7'>*%-*.&,?'
!6-??$HII(#26-F#G*=h$
$G)."8;"8*".G,"-'.J'.N"1.! Chromosome: reference sequence name
2.! Position: reference coordinate in position (1-based)
3.! Reference Base: base of the genome, or `*' for an indel line
4.! Genotype: where heterozygotes are encoded in the IUPAC/IUB code: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.
5.! Consensus Quality: Phred-scaled likelihood that the genotype is wrong
6.! SNP Quality: Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
7.! RMS: root mean square (RMS) mapping quality, a measure of the variance of quality scores
8.! Coverage: # reads covering the position
9.! Bases with Support/Indel#1: Bases used for SNP line, “^” from CIGAR N/S/H break, “$” end of read segment; the 1st indel allele otherwise
10. !Quality of bases/Indel#2: base quality at a SNP line; the 2nd indel allele otherwise
11. !INDEL#1: # reads directly supporting the 1st indel allele
12. !INDEL#2: # reads directly supporting the 2nd indel allele
13. !INDEL#3: # reads supporting a third indel allele
14. !Blank
/!M%--:;"J8:,'J"';,;".G,"/^5"(-F,:"76-("M!L@"
)*F"G);")"7,K"-J.8-*;"
V#.2/0?8$
5"$]3-1.-$.:-$"/0?-0?F?$91?-$1.$-1":$#/?2.2/0$
5<$W:/O$#/?2.2/0?$.:1.$4/$0/.$1;3--$O2.:$.:-$3-7-3-0"-$;-0/>-$S:;*B(71T$
57$a:-$3-7-3-0"-$;-0/>-$2?$20$_'Wa'$7/3>1.$
$G).")6,";-(,"-.G,6"78:.,6;N"
Parameter INTEGER [Default Value]
-Q INT minimum RMS mapping quality for SNPs [25]
-q INT minimum RMS mapping quality for gaps [10]
-d INT minimum read depth [3]
-D INT maximum read depth [100]
-G INT min indel score for nearby SNP filtering [25]
-w INT SNP within X bp around a gap to filter [10]
-W INT window size for filtering dense SNPs [10]
-N INT max number of SNPs in a window [2]
-l INT window size for filtering adjacent gaps [30]
International Union of Biochemistry (IUB) / or Intn’l Union of Pure and Applied Chemistry (IUPAC) Codes
2+#$& 3$45"5)5+"& ($-"5"1&
!" !F,*8*," 6"
<" <?.-;8*," 2"
+" +')*8*," 7"
%" %G?(8*," 8"
R" !+" J'98*,"
c" <%" J:68(8F8*,"
3" +%" ;,.-"
M" !<" )(8*-"
/" +<" <.6-*9"
$" !%" =,)E"
#" <+%" ^-."!"
0" !+%" ^-."<"
&" !<%" ^-."+"
`" !<+" ^-."%"
^" !+<%" ),?"
%=8,K""G-%,&H'"*F)'4$<%'/01'!.+)I'
!()?1>.//6?$204-P$b04-6?G-*=(?/3.-4(91>$
J$='4$<'9*.'8$$F'*&'4$<%'*8-:.").&,'
!()?1>.//6?$.<2-O$b04-6?G-*=(?/3.-4(91>$(();-0/>-?):;*B(71$!()?1>.//6?$.<2-O$HII==*=*JGRH(?/3.-4(91>$(();-0/>-?):;*B(71$
!,/$./$1$?#-"272"$20.-3<16$;$
a:-0$.Z#-8$":3Y8DYDJAJ*A$$
CJ,*"O)D.-6;"-7")"`)68)*.I;"O8F,:8.?"
l/O$4/$O-$^0/O$.:-$EF162.Z$2?$;//4m$
(N) Number of reads supporting that site,
(Pv) Probability of that platform-specific variant change,
(QVD) The average deviation of the quality values, (T) The set of alignments with unique start sites,
(D) PCR Duplicates,
(S) Strand representation (half on one, half on the other),
(Z) Zygosity change (CNV regions)
(C) Cellular heterogeneity
+,*-(8D"R,:).8=8.?"D)*"D6,).,"8*.,6,;.8*9"
D-(J-'*F"G,.,6-d?9-.,;"
>*?+.+& 0+>-)5+"& @-)5$")A3& ?$-#B& @-)5$")A9& ?$-#B& @-)5$")'3& ?$-#B& @-)5$")'9& ?$-#B&
DG622" X2X\S2_]" ^!" Q" [+BY+%<+" 2S" ^!" Q" [+B[+" W"
-G!Paternal A!
C!
A!
T!
T!
G!
G!
G!
A!
C!
A!
T!
G!
T!
G!
G!
G!
Reference Maternal A!
C!
A!
T!
G!
G!
T!
C!
G!
T!
G!
G!
G!
+GTCG!
[++%<+BY++%<+"
W[2Qe"D-=,6)9,";'778D8,*."7-6"G89G[a'):8.?"/^5"D)::;""
`)68)*."<)::"O-6()."
G..JbBB2QQQ9,*-(,;f-69BK8E8BF-E'fJGJN
8Fg2QQQh9,*-(,;b)*):?;8;b=D7=VfS"
Columns:
1.! #CHROM 2.! POS
3.! ID 4.! REF
5.! ALT
6.! QUAL 7.! FILTER
8.! INFO
1Kg
C.G,6"!*):?;8;"CJ.8-*;"
a:-$,-0/>-$'016Z?2?$a//6^2.$S,'a+T$
:..#8))OOO(93/1420?.2.F.-(/3;);?1)O2^2)204-P(#:#)a:-G,-0/>-G'016Z?2?Ga//6^2.$
R2"134$
:..#8))#2"134(?/F3"-7/3;-(0-.)204-P(?:.>6$
#8-5,6:b"
" #8-bb0#bb/)("
!*):?d,"O,).'6,;"-7"H66-6;"7-6"
,)DG"P);,"K8.G"+!%3"
•!R,J-6.,F"a'):8.?";D-6,""
•!%G,"J-;8.8-*"K8.G8*".G,"6,)F""
•!%G,"J6,D,F8*9")*F"D'66,*."*'D:,-.8F,"
1;,a',*D8*9"DG,(8;.6?",77,D.4"-P;,6=,F"P?"
.G,";,a',*D8*9"()DG8*,""
•!56-P)P8:8.?"-7"(8;().DG8*9".G,"6,7,6,*D,"
9,*-(,"
•!R,[D):D':).,"L;D-6,;"
+!%3"H>)(J:,;"1,>.6)4"_23?.c$6-.n?$>1^-$?F3-$Z/F$:1<-$d1<1$
!$d1<1$N<-3?2/0$
@/Oc$6-.n?$;-.$,'a+$
!$O;-.$7.#8))7.#(93/1420?.2.F.-(/3;)#F9);?1),-0/>-'016Z?2?a+),-0/>-'016Z?2?a+561.-?.(.13(9KL$
%F0K2#Lc$104$F0.13$.:-$726-($$]4$20./$.:1.$423-"./3Z($
\-.n?$;-.$?/>-$-P1>#6-$41.1$./$O/3^$O2.:8$
!"K9,."7.JbBB7.JfP6-)F8*;.8.'.,f-69BJ'PB9;)B,>)(J:,O8:,;B,>)(J:,O8:,;f.)6fPdS"
&-K"()*?"6,)F;"F-"?-'"G)=,N"
i"j)=)"[j)6"+,*-(,!*):?;8;%3fj)6"[R",>)(J:,O!/%!f7);.)""[A",>)(J:,#!MfP)("[%"<-'*.R,)F;"
l/O$>10Z$6/"2$4/$Z/F$:1<-m$
!"j)=)"[j)6"+,*-(,!*):?;8;%3fj)6"[R",>)(J:,O!/%!f7);.)""[A",>)(J:,#!MfP)("[%"<-'*.k-D8$
+!%3"+,*-(,"56-D,;;8*9")F?3)6/"16)"6F?.-3)?/7.O13-)d3-*(A51>4AJ)d3-*(A(=G=*)920)d1<1$5d13$,-0/>-'016Z?2?a+(d13$5I$:;*C(71?.1$5b$(()YD*QD*I(91>(?/3.-4(91>$5a$]/F0.I-14?$
+!%3"CJ.8-*;"<+.$&6C-50-D0$&-"-0EB$BF&
`)68)*.!**-.).-6""""""""!**-.).,;"=)68)*."D)::;"K8.G"D-*.,>."8*7-6().8-*f"
0,J.GC7<-=,6)9," "<-(J'.,;".G,"F,J.G"-7"D-=,6)9,").")::":-D8"8*".G,";J,D878,F"6,98-*"-7".G,"6,7,6,*D,f"
`)68)*.O8:.6).8-*"O8:.,6;"=)68)*."D)::;"';8*9")"*'(P,6"-7"';,6[;,:,D.)P:,@"J)6)(,.,68d)P:,"D68.,68)f"
U*878,F+,*-.?J,6 "!"=)68)*."D)::,6"KG8DG"'*878,;".G,")JJ6-)DG,;"-7";,=,6):"F8;J)6).,"D)::,6;f"
A*F,:+,*-.?J,6`S" "%G8;"8;")";8(J:,@"D-'*.;[)*F[D'.-77;"P);,F".--:"7-6"D)::8*9"8*F,:;"76-("):89*,F" " "
" " " ";,a',*D8*9"F).)f"""
A*F,:R,):89*,6 "5,67-6(;":-D):"6,):89*(,*."-7"6,)F;"P);,F"-*"(8;):89*(,*.;"F',".-".G,"J6,;,*D,"-7"8*F,:;f"""
R,):89*,6%)69,.<6,).-6""H(8.;"8*.,6=):;"7-6".G,"k-D):"A*F,:"R,):89*,6".-".)69,."7-6"D:,)*8*9f"
<-'*.k-D8 " "$):E;"-=,6".G,"8*J'."F).)";,.@"D):D':).8*9".G,".-.):"l"-7"D-=,6,F":-D8"7-6"F8)9*-;.8D"J'6J-;,;f"""
<-'*.R,)F; "$):E;"-=,6".G,"8*J'."F).)";,.@"D):D':).8*9".G,"l"-7"6,)F;";,,*"7-6"F8)9*-;.8D"J'6J-;,;f"""
`):8F).8*958:,'J"!.",=,6?":-D';"8*".G,"8*J'.";,.@"D-(J)6,;".G,"J8:,'J"F).)"16,7,6,*D,"P);,@"):89*,F"P);,"
`)68)*.H=): "!"6-P';.")*F"9,*,6):"J'6J-;,".--:"7-6"DG)6)D.,68d8*9".G,"a'):8.?"-7"/^5;@"A*F,:;@")*F"
-.G,6"""""""""""""""""""""""""""=)68)*.;".G)."8*D:'F,;"P);8D"D-'*.8*9@".8B.=@"FP/^5m"187"[0"8;"J6-=8F,F4@"D-*D-6F)*D,".-"
DG8J""-6"=):8F).8-*"F).)@")*F"K8::";G-K"8*.,6,;.8*9";8.,;"1[`4".G).")6,"O^;@"O5@",.DfK):E,6;"
A7";G)68*9")"D:';.,6@"P,")"9--F"*,.8d,*"
L-*'()*-*'(%&-'(0'I*'2'F003'&0C(?2#(*2".'
*(!]:-"^$Z/F3$42?^$F?1;-$S47$N:T$
L(!WF9>2.$d/9?$./$.:-$EF-F-c$27$1<126196-$
Q(!W:13-$;-0/>-$2042"-?$20$/0-$#61"-$
J(!\-1<-$.:-$"1>#$?2.-$9-..-3$.:10$Z/F$7/F04$2.($$]6-10$F#$/64$726-?[$
D(!@-<-3$1??F>-$1$91"^F#$2?$.:-3-o$>1^-$Z/F3$/O0$*(! \2<-WZ0"$7/3$?Z0":/032K1.2/0$
L(! a2>-$]1#?F6-$7/3$%1"^F#$