Church emory2013

65
Deanna M. Church Staff Scientist, NCBI @deannachurch The intersection of genome assembly and variation management.

description

Seminar at Emory Sep 2013

Transcript of Church emory2013

Page 1: Church emory2013

Deanna M. Church Staff Scientist, NCBI

@deannachurch

The intersection of genome assembly and

variation management. 

Page 2: Church emory2013

http://genomereference.org

Valerie Schneider, NCBI

Page 3: Church emory2013

Variation Resources Team at NCBI

Ming WardLon PhanBrad HolmesAnna GlodekMichael KholodovRama MaitiJuliana SampsonDavid ShaoEugene ShekhtmanQiang WangHua Zhang

Donna MaglottMelissa LandrumJennifer LeeGeorge RileyRay TullyCraig WallinShanmuga ChitipirallaDouglas HoffmanWonhee JangKen KatzMichael OvetskyRicardo Villamarin

Tim HefferonJohn LopezJohn GarnerChao Chen

Page 4: Church emory2013

Learning Objectives

Why the reference assembly matters for your analysis

How the reference assembly is changing

Tools and Resources to find data

Page 5: Church emory2013

Why should you care about the Reference Assembly?

Page 6: Church emory2013

Genes, NCBI Homo sapiens Annotation Release 105

Transcript

CDS

dbSNP Build 138 using annotation release 104

Page 7: Church emory2013
Page 8: Church emory2013

http://www.bioplanet.com/gcat

Page 9: Church emory2013

What is the Reference Assembly?

Page 10: Church emory2013
Page 11: Church emory2013
Page 12: Church emory2013
Page 13: Church emory2013
Page 14: Church emory2013

An assembly is a MODEL of the genome

Page 15: Church emory2013
Page 16: Church emory2013

BAC insertBAC vector

Shotgun sequence

Assemble

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Page 17: Church emory2013
Page 18: Church emory2013

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1012

Page 19: Church emory2013

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321

Page 20: Church emory2013

RP11-34P13 64E8 RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7

Gaps

Page 21: Church emory2013

http://genomereference.org

Page 22: Church emory2013

NCBI36 (hg18)

GRC

h37

(hg1

9)

Page 23: Church emory2013

NCBI35 (hg17)

GRCh37 (hg19)

AL139246.20

AL139246.21

Page 24: Church emory2013

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

Page 25: Church emory2013

NCBI36

Page 26: Church emory2013

nsv832911 (nstd68) Submitted on NCBI35 (hg17)

Page 27: Church emory2013

NCBI35 (hg17) Tiling Path

GRCh37 (hg19) Tiling Path

Gap Inserted

Moved approximately 2 Mb distal on chr15

NC_0000015.8 (chr15)

NC_0000015.9 (chr15)

Removed from assembly

Added to assembly

HG-24

Page 28: Church emory2013

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

Page 29: Church emory2013

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

nsv532126 (nstd37)

Page 30: Church emory2013

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Page 31: Church emory2013

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

Page 32: Church emory2013

Data management and the Reference Assembly?

Page 33: Church emory2013

NC_000086.123456 CM001013.17 2Mouse chrX: 34,800,000-34,890,000

Page 34: Church emory2013

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 MGSCv36

Page 35: Church emory2013

ABC14-1065514J1GapsPhase LengthDate

FP565796.1 1 121-Oct-2009

FP565796.2 1 014-Oct-2010

FP565796.3 3 007-Nov-2010

Page 36: Church emory2013

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

Page 37: Church emory2013

chr21:8,913,216-9,246,964

Page 38: Church emory2013

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

Page 39: Church emory2013

http://www.ncbi.nlm.nih.gov/genome/assembly

Page 40: Church emory2013
Page 41: Church emory2013

GenBank RefSeq vs

Submitter Owned RefSeq Owned

Redundancy Non-RedundantUpdated rarely Curated

INSDC Not INSDC

BRCA183 genomic records31 mRNA records27 protein records

3 genomic records 5 mRNA records1 RNA record5 protein records

Page 42: Church emory2013

http://www.ncbi.nlm.nih.gov/refseq/rsghttp://www.lrg-sequence.org/

Page 43: Church emory2013

http://www.ncbi.nlm.nih.gov/refseq/rsg

RefSeq Gene

L R

Page 44: Church emory2013
Page 45: Church emory2013

http://www.ncbi.nlm.nih.gov/genome/tools/remap

From Assembly 1 <-> Assembly 2Assembly <-> RefSeqGene/LRGPrimary Assembly <-> Alternate loci

Page 46: Church emory2013

Variant Calling and the Reference Assembly

Page 47: Church emory2013

Kidd et al, 2007 APOBEC cluster

Part of chr22 assembly

Alternate locus for chr22

White: InsertionBlack: Deletion

Page 48: Church emory2013

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Page 49: Church emory2013

Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38

Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

(Paralogous)

(Allelic)Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

Doggett et al., 2006

Page 50: Church emory2013

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CDC27

1KG Phase 1 Strict accessibility mask

SNP (all)

SNP (not 1KG)

Page 51: Church emory2013

Sudmant et al., 2010

Page 52: Church emory2013

Issues with the Reference Assembly

Page 53: Church emory2013

http://genomereference.org

Page 54: Church emory2013

Dennis et al., 2012

1q32 1q21 1p21

1p21 patch alignment to chromosome 1

Page 55: Church emory2013

Fixing Rare/Incorrect Bases

Page 56: Church emory2013

Adding Novel Sequence

Karen Miga and Jim Kent arXiv:1307.0035

Page 57: Church emory2013

Preview of GRCh38 (scheduled Fall 2013)

TEX28 TKTL1

LOC101060233(opsin related)

LOC101060234(TEX28 related)

GRCh37 (current reference assembly)NC_000023.10 (chrX)

NW_003871103.3

Page 58: Church emory2013

FAM23_MRC1 Region, chr10

Segmental Duplications

1KG accessibility Mask

Novel Patch 250 kb of artificial duplication

Page 59: Church emory2013

Adding Novel Sequence

Page 60: Church emory2013

GRCh37p13120 Fix Patches60 Novel

Human Resolved for GRCh38

http://genomereference.org

Page 61: Church emory2013

How to identify problemregions in the

Reference Assembly

Page 62: Church emory2013

1000 Genomes Browser: http://www.ncbi.nlm.nih.gov/variation/tools/1000genomesGeT-RM Browser: http://www.ncbi.nlm.nih.gov/variation/tools/getrmVariation Viewer: http://www.ncbi.nlm.nih.gov/variation/view (coming Oct 2013!)

Page 63: Church emory2013
Page 64: Church emory2013

Tiling Path

Sequence Bar

Segmental Duplications, Eichler Lab

1000 Genomes strict accessibility mask

Annotated clone assembly problems

Page 65: Church emory2013

dbSNP Build 138 based on annotation run 104

Model based paralogous sequence differences, NCBI annotation run #Paralogous/pseudo gene alignments, NCBI annotation run #

Single Unique Nucleotide (SUN) map, Sudmant 2010ClinVar Long Variations

GRC Curation Issues

ClinVar Short Variations