Church_GenomeAccess_2013_genome2013
-
Upload
deanna-church -
Category
Technology
-
view
4.997 -
download
0
description
Transcript of Church_GenomeAccess_2013_genome2013
Deanna M. Church Staff Scientist, NCBI
@deannachurch
Genome Sequencing and Assembly The human reference assembly
http://genomereference.org
Valerie Schneider, NCBI
Why should you care about the Reference Assembly?
Genes, NCBI Homo sapiens Annotation Release 105
Transcript
CDS
dbSNP Build 138 using annotation release 104
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
Human assemblies available in the NCBI assembly database
http://www.ncbi.nlm.nih.gov/assembly
N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.
What is the Reference Assembly?
BiologyRepetitive sequence (interspersed repeats, segmental duplications)Variation
(regions of high diversity, structural variation)
Kidd et al., 2008
GRCh37 (Primary)
TechnologyRead length long reads vs. short readsMate lengths distribution of insert sizesRead accuracy error model for your technologyRead depth coverage at each baseGenome distribution reads covering entire genome equally
Ajay et al., 2011
An assembly is a MODEL of the genome
Collins FS et al, 1998
Throughput: 500 Mb/yearCost: < $0.25 per base
Variation: 100,000 SNPs mapped
February 2001
Genome Research, May, 1997
Contig: a sequence constructed from smaller, overlapping sequences, which contains no gaps.
Scaffold: a sequence constructed from smaller sequences, which may contain gaps.
Genome Vocabulary
Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ
Typically built from sequences in GenBank/EMBL/DDBJ
Restrict and make libraries2, 4, 8, 10, 40, 150 kb
End-sequence allclones and retainpairing information“mate-pairs”
Find sequence overlaps
Each end sequenceis referred to as a read
WGS contig
tails
WGS: Sanger Reads
Scaffold
A T T T T C C C T T C T G A A A T G A T G A A A G A G T C
BAC insertBAC vector
Shotgun sequence
Assemble
GAPS
“finishers” go in to manually fill the gaps, often by PCR
Lander and Waterman(1988) Genomics
Reads are randomly distributedOverlap between reads does not vary
AssumptionsVariables:G= haploid genome length in bpL= sequence read length in bpN= number of reads sequencedT= amount of overlap needed for detection in bpC= Coverage (C=LN/G)
Poisson distribution:P(Y=y)=(ly * e–l)/y!
y= number of events in an interval
l = mean number of events in an interval
For sequence calculations, coverage can be viewed as l
SequencedNot sequenced
1X Coverage5X Coverage
10X Coverage
37% 63%0.6% 99.4%
0.005% 99.995%
2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base
This clone: Shotgun=$1500Finish=$3000
tetra
odon
mun
tjak_
indian
zebr
afinc
h
zebr
afish
mac
aque
alliga
tor
chick
en
shee
p
mon
odelp
his
oran
gutan
goril
la
verv
et
cpba
t
chim
p
owl_m
onke
y cat
pig
dusk
y_titi co
w
eleph
ant
fugu
babo
on dog
hedg
ehog
shre
w
arm
adillo
opos
sum
squir
rel_m
onke
yra
bbit
galag
o
lemur
rfbat ra
t
mou
se
mar
mos
et
wallab
y
colob
us_m
onke
y
platyp
us
0
1
2
3
4
5
6
7
8
9
10
Sequence Gaps : Uncaptured vs. Total
Uncaptured gaps Captured gaps
Species
Gap
Ave
. per
BA
C
Captured gap= no sequence, but a sub-clone spans the gapUncaptured gap= no sequence, no sub-clone spanning gap
Bob Blakesley, NISC
A
BCD
EFGH
IJKLMNO
ABCD
FGH
KL
ON
Ideally…
Non-sequence based Map
(flip)
ABCD
FGH
KL
ON
More like…
A
BCD
EFGH
IJKLMNO
A
BC
ZYX
W
H
J
M
V
N
O
AB
HIJ
CDY
LMNO
AB
HIJ
LMNO
?
Sequence vs. Non-sequence based mapsMmu7
WI GeneticWI/MRC RH
EnrichmentObservedExpected
-5
-4
-3
-2
-1
0
1
2
3
4
5
60
40
20
0
20
40
60
Maj
or h
isto
com
patib
ility
com
plex
ant
igen
Che
mok
ine
Tum
or n
ecro
sis
fact
or r
ecep
tor
Oth
er c
ytok
ine
rece
ptor
Cys
tein
e pr
otea
se in
hibi
tor
CA
M fa
mily
adh
esio
n m
olec
ule
Apo
lipop
rote
in
KR
AB
box
tran
scrip
tion
fact
or
Inte
rmed
iate
fila
men
t
Imm
unog
lobu
lin r
ecep
tor
fam
ily m
embe
r
Oth
er c
ell a
dhes
ion
mol
ecul
e
Zin
c fin
ger
tran
scrip
tion
fact
or
Def
ense
/imm
unity
pro
tein
Str
uctu
ral p
rote
in
Cys
tein
e pr
otea
se
Cyt
okin
e re
cept
or
Oxy
gena
se
Cel
l adh
esio
n m
olec
ule
Tra
nscr
iptio
n fa
ctor
Mis
cella
neou
s fu
nctio
n
Sig
nalin
g m
olec
ule
Oxi
dore
duct
ase
Unc
lass
ified
Nuc
leic
aci
d bi
ndin
g
Sel
ect r
egul
ator
y m
olec
ule
Kin
ase
Hyd
rola
se
Rib
osom
al p
rote
in
Pro
tein
kin
ase
G-p
rote
in m
odul
ator
Ext
race
llula
r m
atrix
Oth
er tr
ansc
riptio
n fa
ctor
Human- panther classifications (biological process)
Evan Eichler, University of Washington
Fragmented genomes tend to have more partial models
Fragmented genomes have fewer frameshifts
Alexander Souvorov, NCBI
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1012
RP11-34P13 64E8 RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7
Gaps
NCBI36 (hg18)
GRC
h37
(hg1
9)
NCBI35 (hg17)
GRCh37 (hg19)
AL139246.20
AL139246.21
Build sequence contigs based on contigs defined in TPF (Tiling Path File).
Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis
Switch point
Consensus sequence
NCBI36
nsv832911 (nstd68) Submitted on NCBI35 (hg17)
NCBI35 (hg17) Tiling Path
GRCh37 (hg19) Tiling Path
Gap Inserted
Moved approximately 2 Mb distal on chr15
NC_0000015.8 (chr15)
NC_0000015.9 (chr15)
Removed from assembly
Added to assembly
HG-24
http://genomereference.org
http://genomereference.org
Distributed data
Genome not in INSDC Database
Old Assembly Model
Human Genome Project (HGP)
5 July 2011
Issue tracking system (based on JIRA)
http://genomereference.org
Full Dovetail
Half-dovetail
Contained
Short/Blunt
AGP: A Golden Path
Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types
GRC Produces• AGP• FASTA
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
Assembly (e.g. GRCh37)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
AC074378.4AC079749.5
AC134921.2AC147055.2
AC140484.1AC019173.4
AC093720.2AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4AC079749.5
AC134921.1AC147055.2
AC093720.2AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4AC140484.1
AC019173.4AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 Region
7 alternate haplotypesat the MHC
Alternate loci released as:FASTA
AGPAlignment to chromosome
UGT2B17 MHC MAPT
GRCh37 (hg19)
Oh No! Not a new version of the human reference!
http://genomereference.org
Assembly (e.g. GRCh37.p13)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
…
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Patches
Genomic Region(ABO)
Genomic Region(SMA)
Genomic Region
(PECAM1)
MHC (chr6)Chr 6 representation (PGF)
Alt_Ref_Locus_2 (COX)
17q deletion
H1
H2
Zody et al, 2008
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning
reads to the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds
Distributed data
Old Assembly Model
Centralized Data
Updated Assembly Model
Genome not in INSDC Database
http://www.ncbi.nlm.nih.gov/genome/assembly
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
Updated Assembly Model
Genome in INSDC Database
Genome not in INSDC Database
Variant Calling and the Reference Assembly
http://www.bioplanet.com/gcat
Kidd et al, 2007 APOBEC cluster
Part of chr22 assembly
Alternate locus for chr22
White: InsertionBlack: Deletion
Rawe et al, 2013
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320
NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
129S6/SvEvTac Alt Locus Alignment Ren1 (allelic)
FVB/N Transcript Alignment Ren2 (paralog)
129S6/SvEvTac Ren1
FVB Ren2 Tx
Paralogousdiff
SNP +Paralogous
diff
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320
NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
(Paralogous)
(Allelic)Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Doggett et al., 2006
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
CDC27
1KG Phase 1 Strict accessibility mask
SNP (all)
SNP (not 1KG)
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
Sudmant et al., 2010
GRCh38 is coming(September, 2013)
GRCh37 Scaff N50: 44,983,201GRCh37B Scaff N50: 62,124,159
GRCh37 Contig N50: 38,440,852GRCh37B Contig N50: 49,319,739
Modeled CentromeresIndividual base updatesFixed tiling path/assembly errorsAddition of novel sequence
Major Features of GRCh38
Adding Novel Sequence
Karen Miga and Jim Kent arXiv:1307.0035
Dennis et al., 2012
1q32 1q21 1p21
1p21 patch alignment to chromosome 1
61-mer analysis
set9664
1kG high-confidence
set1358
4222
Ref allele frequency = 0Mismatches MAF = 0
n=15,244
MAF=0Insertio
nsn=834
MAF=0Deletion
sn=1541
MAF<5%Mismatc
h in pseudo/pr txptn=1413
Annotator and clinical
requestsn= ~260
Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components
79% of these bases are heterozygous in RP11 WGS
GRCh37 Insertions Originating from RP11
GRCh37 Deletions Originating from RP11
17% heterozygous in RP11 WGS
18% heterozygous in RP11 WGS
Fixing Rare/Incorrect Bases
NOVEL GENES!
GRCh37.p13: 211 genes found only on alt loci and patches
Genovese et al., 2013
FAM23_MRC1 Region, chr10
Segmental Duplications
1KG accessibility Mask
Novel Patch 250 kb of artificial duplication
Adding Novel Sequence
GRCh37p13120 Fix Patches60 Novel
Human Resolved for GRCh38
http://genomereference.org
Remap Set up slide
GRCh38 is coming(September, 2013)