GRC Workshop

Post on 23-Mar-2016

127 views 3 download

Tags:

description

GRC Workshop. ASHG. 22 Oct 2013. Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data. http://genomereference.org. Reference Assembly Basics. What is the Reference Assembly?. An assembly is a MODEL of the genome. - PowerPoint PPT Presentation

Transcript of GRC Workshop

GRC WorkshopASHG

22 Oct 2013

OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data

http://genomereference.org

What is the Reference Assembly?

Reference Assembly Basics

An assembly is a MODEL of the genome

Lander and Waterman(1988) Genomics

Reads are randomly distributedOverlap between reads does not vary

AssumptionsVariables:G= haploid genome length in bpL= sequence read length in bpN= number of reads sequencedT= amount of overlap needed for detection in bpC= Coverage (C=LN/G)

Poisson distribution:P(Y=y)=(ly * e–l)/y!y= number of events in an intervall = mean number of events in an interval

For sequence calculations, coverage can be viewed as l

Reference Assembly Basics

Using this equation, you can calculate the probability that a base hasbeen sequenced y number of times.

By manipulating this formula, you can estimate the numbers of gaps for any given level of coverage.

SequencedNot sequenced1X Coverage5X Coverage

10X Coverage

37% 63%0.6% 99.4%

0.005% 99.995%

Reference Assembly Basics

2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base

This clone: Shotgun=$1500Finish=$3000

Reference Assembly Basics

Reference Assembly Basics

tetrao

don

muntja

k_ind

ian

zebra

finch

zebra

fish

macaq

ue

alliga

tor

chick

ensh

eep

monod

elphis

orang

utan

gorill

ave

rvet

cpba

t

chim

p

owl_m

onke

y cat

pig

dusk

y_titi co

w

eleph

ant

fugu

babo

on dog

hedg

ehog

shrew

armad

illo

opos

sum

squir

rel_m

onke

yrab

bit

galag

olem

urrfb

at rat

mouse

marmos

et

wallab

y

colob

us_m

onke

y

platyp

us

0

1

2

3

4

5

6

7

8

9

10

Sequence Gaps : Uncaptured vs. Total

Uncaptured gaps Captured gaps

Species

Gap

Ave

. per

BA

C

Captured gap= no sequence, but a sub-clone spans the gapUncaptured gap= no sequence, no sub-clone spanning gap

Bob Blakesley, NISC

Reference Assembly Basics

BiologyRepetitive sequence (interspersed repeats, segmental duplications)Variation

(regions of high diversity, structural variation)

Kidd et al., 2008

Reference Assembly Basics

Reference Assembly Basics

Eugene Yaschenko, NCBI

EnrichmentObservedExpected

-5

-4

-3

-2

-1

0

1

2

3

4

5

60

40

20

0

20

40

60

Maj

or h

isto

com

patib

ility

com

plex

ant

igen

Che

mok

ine

Tum

or n

ecro

sis

fact

or re

cept

or

Oth

er c

ytok

ine

rece

ptor

Cys

tein

e pr

otea

se in

hibi

tor

CA

M fa

mily

adh

esio

n m

olec

ule

Apo

lipop

rote

in

KR

AB

box

tran

scrip

tion

fact

or

Inte

rmed

iate

fila

men

t

Imm

unog

lobu

lin re

cept

or fa

mily

mem

ber

Oth

er c

ell a

dhes

ion

mol

ecul

e

Zinc

fing

er tr

ansc

riptio

n fa

ctor

Def

ense

/imm

unity

pro

tein

Stru

ctur

al p

rote

in

Cys

tein

e pr

otea

se

Cyt

okin

e re

cept

or

Oxy

gena

se

Cel

l adh

esio

n m

olec

ule

Tran

scrip

tion

fact

or

Mis

cella

neou

s fu

nctio

n

Sig

nalin

g m

olec

ule

Oxi

dore

duct

ase

Unc

lass

ified

Nuc

leic

aci

d bi

ndin

g

Sel

ect r

egul

ator

y m

olec

ule

Kin

ase

Hyd

rola

se

Rib

osom

al p

rote

in

Pro

tein

kin

ase

G-p

rote

in m

odul

ator

Ext

race

llula

r mat

rix

Oth

er tr

ansc

riptio

n fa

ctor

Human- PANTHER classifications (biological process)

Evan Eichler, University of Washington

Reference Assembly Basics

TechnologyRead length long reads vs. short readsMate lengths distribution of insert sizesRead accuracy error model for your technologyRead depth coverage at each baseGenome distribution reads covering entire genome equally

Ajay et al., 2011

Genome Research, May, 1997

Reference Assembly Basics

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger Reads

Scaffold

Reference Assembly Basics

Contig: a sequence constructed from smaller, overlapping sequences, which contains no gaps.

Scaffold: a sequence constructed from smaller sequences, which may contain gaps.

Genome Vocabulary

Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ

Typically built from sequences in GenBank/EMBL/DDBJ

Reference Assembly Basics

Schatz et al, 2010

Reference Assembly Basics

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

Reference Assembly Basics

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

sequ

ence

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assemblies

Reference Assembly Basics

A

BCD

EFGH

IJKLMNO

ABCD

FGH

KL

ON

Ideally…

Non-sequence based Map

(flip)

ABCD

FGH

KL

ON

Reference Assembly Basics

More like…

A

BCD

EFGH

IJKLMNO

A

BC

ZYX

W

HJ

M

V

N

O

AB

HIJ

CDY

LMNO

AB

HIJ

LMNO

?

Reference Assembly Basics

Sequence vs. Non-sequence based mapsMmu7

WI GeneticWI/MRC RH

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

Reference Assembly Basics

Reference Assembly Basics

Reference Assembly Basics

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.

Reference Assembly BasicsFragmented genomes tend to

have more partial modelsFragmented genomes have

fewer frameshifts

Alexander Souvorov, NCBI

OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data

http://genomereference.org

http://genomereference.org

Distributed data

Genome not in INSDC Database

Old Assembly Model

GRC Assembly Management

Human Genome Project (HGP)

GRC Assembly Management

GRC Assembly Management

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

GRC Assembly Management

Issue tracking system (based on JIRA)

GRC Assembly Management

http://genomereference.org

GRC Assembly Management

GRC Assembly Management

5 July 2011

GRC Assembly Management

GRC Assembly Management

ACCESSION NAME CONTIG

GAP Telomere 10000

AP006221 XX-190A2 Hschr1_ctg1

AL627309 RP11-34P13 Hschr1_ctg1

GAP type-3

AC114498 RP5-857K21 Hschr1_ctg3

AL669831 RP11-206L10 Hschr1_ctg3

AL645608 RP11-54O7 Hschr1_ctg3

Tiling Path File (TPF)

GRC Assembly Management

Full Dovetail

Half-dovetail

Contained

Short/Blunt

GRC Assembly Management

GRC Assembly Management

GRC Assembly Management

GRC Assembly Management

GRC Assembly Management

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Representative chromosome sequence

GRC Assembly Management

AGP: A Golden Path

Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types

GRC Produces

GRC Assembly Management

• AGP• FASTA

Distributed data

Old Assembly ModelCentralized Data

Updated Assembly Model

GRC Assembly Management

Genome not in INSDC Database

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

GRC Assembly Management

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

GRC Assembly Management

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 RegionGRC Assembly Management

GRC Assembly Management

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

GRCh37 (hg19)

Assembly (e.g. GRCh37.p13)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

GRC Assembly Management

GRC Assembly Management

GRCh37.p13• 178 Regions: 3.15% of chromosome

sequence• 131 FIX patches: add 6.8 Mb novel

sequence• 73 NOVEL patches: add >800kb novel

sequence

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

GRC Assembly Management

17q deletion

H1

H2

Zody et al, 2008

GRC Assembly Management

GRC Assembly Management

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRC Assembly Management

GRC Assembly Management

Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning

reads to the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds

GRC Assembly Management

Distributed data

Genome not in INSDC Database

Old Assembly ModelCentralized Data

Updated Assembly Model

Genome in INSDC DatabaseGenome not in INSDC Database

GRC Assembly Management

OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data

http://genomereference.org

GRCh38 Impact

GRCh38

GRCh38 Impact

GRCh37 Scaff N50: 44,983,201GRCh37B Scaff N50: 62,124,159

GRCh37 Contig N50: 38,440,852GRCh37B Contig N50: 49,319,739

GRCh38 Impact

GRCh38 Impact

Modeled CentromeresIndividual base updatesFixed tiling path/assembly errorsAddition of novel sequence

GRCh38 Impact

Major Features of GRCh38

CENTROMERES

GRCh38 Impact

61-mer analysis set

9664

1kG high-confidence set

13584222

Mismatches MAF = 0n=15,244

MAF=0Insertio

nsn=834

MAF=0Deletion

sn=1541

MAF<5%Mismatc

h in pseudo/pr txptn=1413

Annotator and clinical

requestsn= ~260

GRCh38 Impact

Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components

GRCh38 Impact

79% of these bases are heterozygous in RP11 WGS

GRCh37 Insertions Originating from RP11

GRCh38 Impact

GRCh37 Deletions Originating from RP11

17% heterozygous in RP11 WGS 18% heterozygous in RP11 WGS

GRCh38 Impact

GRCh38 Impact

GRCh38 Impact

1q32 1q21 1p211p21 patch alignment to chromosome 1

Dennis et al., 2012GRCh38 Impact

HYDIN: chr16 (16q22.2)HYDIN2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38

Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID

Alignment of HYDIN CHM1_1.0, >99.9% IDAlignment of HYDIN2 Genomic, 300 Kb, 99.4% ID

Alignment of HYDIN CHM1_1.0, >99.9% ID

Doggett et al., 2006GRCh38 Impact

GRCh38 Impact

Other Major Tiling Path Updates• Single CHM1 haplotype paths for:

• 1p12, 1q21, 1q32: SRGAP2• IGH• LRC/KIR• CCL3L1 (17q21)

• OM-guided• 10q11• Chr. 9 peri-centromeric inversion

GRCh38 Impact

NOVEL GENES!GRCh37.p13: 211 genes found only on alt

loci and patches

GRCh38 Impact

Sudmant et al., 2010

Genovese et al., 2013

1000G decoy sequence, viewed by:• GenBank alignment• Percent Repeat Masked• Repeat Mask type• Sequence Source (HTG, HuRef, ALLPATHS)

GRCh38 Impact

In a preliminary analysis, 90% of NA12878 reads that previously aligned uniquely to the decoy sequence had

an alignment to the updated assembly.

GRCh38 Impact

Where is the decoy sequence in GRCh38?• Alt loci (low repeat content)• Model centromeres (high repeat content)• Unlocalized/Unplaced Scaffolds• Chromosomes

OutlineReference Assembly BasicsGRC: Assembly management and dataflowGRCh38Accessing the assembly and data

http://genomereference.org

Accessing the Data

Accessing the Data

Accessing the Data

Accessing the Data

Accessing the Data

NCBI Genes, Ensembl Genes, Annotated Clone Problems, Segmental Duplications

Accessing the Data

Accessing the Data

Accessing the Data

Accessing the Data

Accessing the Data

GRCh38 in Ensembl

GRCh38 will be incorporated into the existing Ensembl interface. Features such as genes, variation, regulation will be remade or remapped onto the new genome. Nearly 500 tracks are available.

GENCODE gene set

Accessing the Data

Alternate sequences in Ensembl

Haplotypes and patches on the chromosome

A fix patch around the ABO gene

Use the Region comparison view to see the difference between the patch and primary assembly

The GRC alignment track indicates edits

View your data on the Genome

Zoomed in

Zoomed out

Follow the link from the homepage

Red bases show mismatches

Transition to GRCh38 in Ensembl

INSDC coordinates identify the assembly as well as the position

Convert coordinates between assemblies

Our blog series details our progress with GRCh38Ensembl.info

Remap Set up slide

Accessing the Data

Accessing the Data

1000 Genomes Browser: http://www.ncbi.nlm.nih.gov/variation/tools/1000genomesGeT-RM Browser: http://www.ncbi.nlm.nih.gov/variation/tools/getrmVariation Viewer: http://www.ncbi.nlm.nih.gov/variation/view (coming Fall 2013!)

Tiling Path

Sequence Bar

Segmental Duplications, Eichler Lab

1000 Genomes strict accessibility mask

Annotated clone assembly problems

dbSNP Build 138 based on annotation run 104

Model based paralogous sequence differences, NCBI annotation run #Paralogous/pseudo gene alignments, NCBI annotation run #

Single Unique Nucleotide (SUN) map, Sudmant 2010ClinVar Long Variations

GRC Curation Issues

ClinVar Short Variations

http://twitter.com/GenomeRefgrc-announce@ncbi.nlm.nih.gov

Accessing the Data

http://genomeref.blogspot.com/

Accessing the Data

Accessing the Data