Church_GenomeAccess_2013_genome2013

109
Deanna M. Church Staff Scientist, NCBI @deannachurch Genome Sequencing and Assembly The human reference assembly

description

Sequencing and assembly lecture for the CSHL genome access course, Nov 2013

Transcript of Church_GenomeAccess_2013_genome2013

Page 1: Church_GenomeAccess_2013_genome2013

Deanna M. Church Staff Scientist, NCBI

@deannachurch

Genome Sequencing and Assembly The human reference assembly

Page 2: Church_GenomeAccess_2013_genome2013

http://genomereference.org

Valerie Schneider, NCBI

Page 3: Church_GenomeAccess_2013_genome2013

Why should you care about the Reference Assembly?

Page 4: Church_GenomeAccess_2013_genome2013

Genes, NCBI Homo sapiens Annotation Release 105

Transcript

CDS

dbSNP Build 138 using annotation release 104

Page 5: Church_GenomeAccess_2013_genome2013

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Page 6: Church_GenomeAccess_2013_genome2013

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

Page 7: Church_GenomeAccess_2013_genome2013
Page 8: Church_GenomeAccess_2013_genome2013

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.

Page 9: Church_GenomeAccess_2013_genome2013

What is the Reference Assembly?

Page 10: Church_GenomeAccess_2013_genome2013
Page 11: Church_GenomeAccess_2013_genome2013
Page 12: Church_GenomeAccess_2013_genome2013

BiologyRepetitive sequence (interspersed repeats, segmental duplications)Variation

(regions of high diversity, structural variation)

Kidd et al., 2008

Page 13: Church_GenomeAccess_2013_genome2013

GRCh37 (Primary)

Page 14: Church_GenomeAccess_2013_genome2013

TechnologyRead length long reads vs. short readsMate lengths distribution of insert sizesRead accuracy error model for your technologyRead depth coverage at each baseGenome distribution reads covering entire genome equally

Ajay et al., 2011

Page 15: Church_GenomeAccess_2013_genome2013

An assembly is a MODEL of the genome

Page 16: Church_GenomeAccess_2013_genome2013
Page 17: Church_GenomeAccess_2013_genome2013

Collins FS et al, 1998

Throughput: 500 Mb/yearCost: < $0.25 per base

Variation: 100,000 SNPs mapped

Page 18: Church_GenomeAccess_2013_genome2013

February 2001

Page 19: Church_GenomeAccess_2013_genome2013

Genome Research, May, 1997

Page 20: Church_GenomeAccess_2013_genome2013

Contig: a sequence constructed from smaller, overlapping sequences, which contains no gaps.

Scaffold: a sequence constructed from smaller sequences, which may contain gaps.

Genome Vocabulary

Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ

Typically built from sequences in GenBank/EMBL/DDBJ

Page 21: Church_GenomeAccess_2013_genome2013

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger Reads

Scaffold

Page 22: Church_GenomeAccess_2013_genome2013
Page 23: Church_GenomeAccess_2013_genome2013

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

Page 24: Church_GenomeAccess_2013_genome2013

BAC insertBAC vector

Shotgun sequence

Assemble

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Page 25: Church_GenomeAccess_2013_genome2013
Page 26: Church_GenomeAccess_2013_genome2013

Lander and Waterman(1988) Genomics

Reads are randomly distributedOverlap between reads does not vary

AssumptionsVariables:G= haploid genome length in bpL= sequence read length in bpN= number of reads sequencedT= amount of overlap needed for detection in bpC= Coverage (C=LN/G)

Poisson distribution:P(Y=y)=(ly * e–l)/y!

y= number of events in an interval

l = mean number of events in an interval

For sequence calculations, coverage can be viewed as l

Page 27: Church_GenomeAccess_2013_genome2013

SequencedNot sequenced

1X Coverage5X Coverage

10X Coverage

37% 63%0.6% 99.4%

0.005% 99.995%

Page 28: Church_GenomeAccess_2013_genome2013

2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base

This clone: Shotgun=$1500Finish=$3000

Page 29: Church_GenomeAccess_2013_genome2013
Page 30: Church_GenomeAccess_2013_genome2013

tetra

odon

mun

tjak_

indian

zebr

afinc

h

zebr

afish

mac

aque

alliga

tor

chick

en

shee

p

mon

odelp

his

oran

gutan

goril

la

verv

et

cpba

t

chim

p

owl_m

onke

y cat

pig

dusk

y_titi co

w

eleph

ant

fugu

babo

on dog

hedg

ehog

shre

w

arm

adillo

opos

sum

squir

rel_m

onke

yra

bbit

galag

o

lemur

rfbat ra

t

mou

se

mar

mos

et

wallab

y

colob

us_m

onke

y

platyp

us

0

1

2

3

4

5

6

7

8

9

10

Sequence Gaps : Uncaptured vs. Total

Uncaptured gaps Captured gaps

Species

Gap

Ave

. per

BA

C

Captured gap= no sequence, but a sub-clone spans the gapUncaptured gap= no sequence, no sub-clone spanning gap

Bob Blakesley, NISC

Page 31: Church_GenomeAccess_2013_genome2013

A

BCD

EFGH

IJKLMNO

ABCD

FGH

KL

ON

Ideally…

Non-sequence based Map

(flip)

ABCD

FGH

KL

ON

Page 32: Church_GenomeAccess_2013_genome2013

More like…

A

BCD

EFGH

IJKLMNO

A

BC

ZYX

W

H

J

M

V

N

O

AB

HIJ

CDY

LMNO

AB

HIJ

LMNO

?

Page 33: Church_GenomeAccess_2013_genome2013

Sequence vs. Non-sequence based mapsMmu7

WI GeneticWI/MRC RH

Page 34: Church_GenomeAccess_2013_genome2013
Page 35: Church_GenomeAccess_2013_genome2013

EnrichmentObservedExpected

-5

-4

-3

-2

-1

0

1

2

3

4

5

60

40

20

0

20

40

60

Maj

or h

isto

com

patib

ility

com

plex

ant

igen

Che

mok

ine

Tum

or n

ecro

sis

fact

or r

ecep

tor

Oth

er c

ytok

ine

rece

ptor

Cys

tein

e pr

otea

se in

hibi

tor

CA

M fa

mily

adh

esio

n m

olec

ule

Apo

lipop

rote

in

KR

AB

box

tran

scrip

tion

fact

or

Inte

rmed

iate

fila

men

t

Imm

unog

lobu

lin r

ecep

tor

fam

ily m

embe

r

Oth

er c

ell a

dhes

ion

mol

ecul

e

Zin

c fin

ger

tran

scrip

tion

fact

or

Def

ense

/imm

unity

pro

tein

Str

uctu

ral p

rote

in

Cys

tein

e pr

otea

se

Cyt

okin

e re

cept

or

Oxy

gena

se

Cel

l adh

esio

n m

olec

ule

Tra

nscr

iptio

n fa

ctor

Mis

cella

neou

s fu

nctio

n

Sig

nalin

g m

olec

ule

Oxi

dore

duct

ase

Unc

lass

ified

Nuc

leic

aci

d bi

ndin

g

Sel

ect r

egul

ator

y m

olec

ule

Kin

ase

Hyd

rola

se

Rib

osom

al p

rote

in

Pro

tein

kin

ase

G-p

rote

in m

odul

ator

Ext

race

llula

r m

atrix

Oth

er tr

ansc

riptio

n fa

ctor

Human- panther classifications (biological process)

Evan Eichler, University of Washington

Page 36: Church_GenomeAccess_2013_genome2013
Page 37: Church_GenomeAccess_2013_genome2013

Fragmented genomes tend to have more partial models

Fragmented genomes have fewer frameshifts

Alexander Souvorov, NCBI

Page 38: Church_GenomeAccess_2013_genome2013
Page 39: Church_GenomeAccess_2013_genome2013

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321

Page 40: Church_GenomeAccess_2013_genome2013

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1012

Page 41: Church_GenomeAccess_2013_genome2013

RP11-34P13 64E8 RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7

Gaps

Page 42: Church_GenomeAccess_2013_genome2013

NCBI36 (hg18)

GRC

h37

(hg1

9)

Page 43: Church_GenomeAccess_2013_genome2013

NCBI35 (hg17)

GRCh37 (hg19)

AL139246.20

AL139246.21

Page 44: Church_GenomeAccess_2013_genome2013

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

Page 45: Church_GenomeAccess_2013_genome2013

NCBI36

Page 46: Church_GenomeAccess_2013_genome2013

nsv832911 (nstd68) Submitted on NCBI35 (hg17)

Page 47: Church_GenomeAccess_2013_genome2013

NCBI35 (hg17) Tiling Path

GRCh37 (hg19) Tiling Path

Gap Inserted

Moved approximately 2 Mb distal on chr15

NC_0000015.8 (chr15)

NC_0000015.9 (chr15)

Removed from assembly

Added to assembly

HG-24

Page 48: Church_GenomeAccess_2013_genome2013

http://genomereference.org

Page 49: Church_GenomeAccess_2013_genome2013

http://genomereference.org

Page 50: Church_GenomeAccess_2013_genome2013

Distributed data

Genome not in INSDC Database

Old Assembly Model

Human Genome Project (HGP)

Page 51: Church_GenomeAccess_2013_genome2013
Page 52: Church_GenomeAccess_2013_genome2013
Page 53: Church_GenomeAccess_2013_genome2013
Page 54: Church_GenomeAccess_2013_genome2013

5 July 2011

Page 55: Church_GenomeAccess_2013_genome2013

Issue tracking system (based on JIRA)

http://genomereference.org

Page 56: Church_GenomeAccess_2013_genome2013
Page 57: Church_GenomeAccess_2013_genome2013

Full Dovetail

Half-dovetail

Contained

Short/Blunt

Page 58: Church_GenomeAccess_2013_genome2013
Page 59: Church_GenomeAccess_2013_genome2013
Page 60: Church_GenomeAccess_2013_genome2013
Page 61: Church_GenomeAccess_2013_genome2013
Page 62: Church_GenomeAccess_2013_genome2013

AGP: A Golden Path

Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types

GRC Produces• AGP• FASTA

Page 63: Church_GenomeAccess_2013_genome2013

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Page 64: Church_GenomeAccess_2013_genome2013

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

Page 65: Church_GenomeAccess_2013_genome2013

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Page 66: Church_GenomeAccess_2013_genome2013

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 Region

Page 67: Church_GenomeAccess_2013_genome2013

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

GRCh37 (hg19)

Page 68: Church_GenomeAccess_2013_genome2013

Oh No! Not a new version of the human reference!

http://genomereference.org

Page 69: Church_GenomeAccess_2013_genome2013
Page 70: Church_GenomeAccess_2013_genome2013

Assembly (e.g. GRCh37.p13)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

Page 71: Church_GenomeAccess_2013_genome2013

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

Page 72: Church_GenomeAccess_2013_genome2013

17q deletion

H1

H2

Zody et al, 2008

Page 73: Church_GenomeAccess_2013_genome2013
Page 74: Church_GenomeAccess_2013_genome2013

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

Page 75: Church_GenomeAccess_2013_genome2013
Page 76: Church_GenomeAccess_2013_genome2013

Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning

reads to the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds

Page 77: Church_GenomeAccess_2013_genome2013

Distributed data

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome not in INSDC Database

Page 78: Church_GenomeAccess_2013_genome2013
Page 79: Church_GenomeAccess_2013_genome2013

http://www.ncbi.nlm.nih.gov/genome/assembly

Page 80: Church_GenomeAccess_2013_genome2013
Page 81: Church_GenomeAccess_2013_genome2013

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome in INSDC Database

Genome not in INSDC Database

Page 82: Church_GenomeAccess_2013_genome2013

Variant Calling and the Reference Assembly

Page 83: Church_GenomeAccess_2013_genome2013

http://www.bioplanet.com/gcat

Page 84: Church_GenomeAccess_2013_genome2013

Kidd et al, 2007 APOBEC cluster

Part of chr22 assembly

Alternate locus for chr22

White: InsertionBlack: Deletion

Page 85: Church_GenomeAccess_2013_genome2013

Rawe et al, 2013

Page 86: Church_GenomeAccess_2013_genome2013

Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320

NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

129S6/SvEvTac Alt Locus Alignment Ren1 (allelic)

FVB/N Transcript Alignment Ren2 (paralog)

Page 87: Church_GenomeAccess_2013_genome2013

129S6/SvEvTac Ren1

FVB Ren2 Tx

Paralogousdiff

SNP +Paralogous

diff

Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320

NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

Page 88: Church_GenomeAccess_2013_genome2013

Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38

Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

(Paralogous)

(Allelic)Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

Doggett et al., 2006

Page 89: Church_GenomeAccess_2013_genome2013

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CDC27

1KG Phase 1 Strict accessibility mask

SNP (all)

SNP (not 1KG)

Page 90: Church_GenomeAccess_2013_genome2013

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Page 91: Church_GenomeAccess_2013_genome2013

Sudmant et al., 2010

Page 92: Church_GenomeAccess_2013_genome2013

GRCh38 is coming(September, 2013)

Page 93: Church_GenomeAccess_2013_genome2013

GRCh37 Scaff N50: 44,983,201GRCh37B Scaff N50: 62,124,159

GRCh37 Contig N50: 38,440,852GRCh37B Contig N50: 49,319,739

Page 94: Church_GenomeAccess_2013_genome2013
Page 95: Church_GenomeAccess_2013_genome2013
Page 96: Church_GenomeAccess_2013_genome2013

Modeled CentromeresIndividual base updatesFixed tiling path/assembly errorsAddition of novel sequence

Major Features of GRCh38

Page 97: Church_GenomeAccess_2013_genome2013

Adding Novel Sequence

Karen Miga and Jim Kent arXiv:1307.0035

Page 98: Church_GenomeAccess_2013_genome2013

Dennis et al., 2012

1q32 1q21 1p21

1p21 patch alignment to chromosome 1

Page 99: Church_GenomeAccess_2013_genome2013

61-mer analysis

set9664

1kG high-confidence

set1358

4222

Ref allele frequency = 0Mismatches MAF = 0

n=15,244

MAF=0Insertio

nsn=834

MAF=0Deletion

sn=1541

MAF<5%Mismatc

h in pseudo/pr txptn=1413

Annotator and clinical

requestsn= ~260

Page 100: Church_GenomeAccess_2013_genome2013

Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components

79% of these bases are heterozygous in RP11 WGS

Page 101: Church_GenomeAccess_2013_genome2013

GRCh37 Insertions Originating from RP11

GRCh37 Deletions Originating from RP11

17% heterozygous in RP11 WGS

18% heterozygous in RP11 WGS

Page 102: Church_GenomeAccess_2013_genome2013

Fixing Rare/Incorrect Bases

Page 103: Church_GenomeAccess_2013_genome2013

NOVEL GENES!

GRCh37.p13: 211 genes found only on alt loci and patches

Page 104: Church_GenomeAccess_2013_genome2013

Genovese et al., 2013

Page 105: Church_GenomeAccess_2013_genome2013

FAM23_MRC1 Region, chr10

Segmental Duplications

1KG accessibility Mask

Novel Patch 250 kb of artificial duplication

Page 106: Church_GenomeAccess_2013_genome2013

Adding Novel Sequence

Page 107: Church_GenomeAccess_2013_genome2013

GRCh37p13120 Fix Patches60 Novel

Human Resolved for GRCh38

http://genomereference.org

Page 108: Church_GenomeAccess_2013_genome2013

Remap Set up slide

Page 109: Church_GenomeAccess_2013_genome2013

GRCh38 is coming(September, 2013)