The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

26
The 1000 Genomes Project Gil McVean Department of Statistics, Oxford
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Page 1: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

The 1000 Genomes Project

Gil McVeanDepartment of Statistics, Oxford

Page 2: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

What is the 1000 Genomes Project?

• A catalogue of all types of genetic variation, including rare variants (c. 1% frequency) obtained by sequencing at least 1000 individuals from geographic centres of major medical genetics interest

• A large international collaboration– UK, USA, China, Germany

• An exploration of the use of next-generation technologies for population-scale genome sequencing

• A resource for accelerating the rate of identifying disease mechanisms in the follow-up to disease-association studies

Page 3: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Samples for the main project

UKFIN

TSIESP

CEU

JPTCHB

CHS

DAI

KVTGMB

GHN

YRI

MLW

LWK

Major population groups comprised of subpopulations of c. 100 each

MXL

ASW

newCMB

PRO

Page 4: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Population-scale genome sequencing

Haplotypes2x

10x

Page 5: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Pilot experiments

• Pilot 1– Low-coverage (2x-4x) on 60 unrelated individuals from each of CEU, YRI and

CHB+JPT

• Pilot 2– High-coverage (20x diploid) on 2 trios (one from CEU, one from YRI)

• Pilot 3– Exons from 1000 genes to 20x in c. 1000 samples (largely European)

Complete!

Page 6: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

The 1000G Low Coverage Pilot

• 185 individuals from 4 populations– CEU (63), CHB (30), JPT (30), YRI (62)

Population Technology N Individuals Mapped Bases (billions)

Mean Coverage / Individual

CEU SLX 52 482 3.09SOLiD 30 240 2.66454 18 132 2.45

CHB SLX 30 234 2.60JPT SLX 28 227 2.70

454 2 9.6 1.60YRI SLX 60 594 3.30

SOLiD 5 20.6 1.38454 2 10.8 1.80

Combined 185 1,884 3.52

Page 7: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Even still, at lot of data isn’t much

• In the Pilot 1 sample 1 tera-basepairs leaves the CEU with…– 6% of genotypes with 0 reads– 16% of genotypes with < 2 reads– 29% of genotypes with < 3 reads

Page 8: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

ftp.1000genomes.ebi.ac.uk

www.1000genomes.org

Pilot release expected Nov/Dec 2009

ftp-trace.ncbi.nih.gov/1000genomes/ftp

Page 9: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
Page 10: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

What has the project already generated?

Page 11: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Over 9 millions novel SNPs

• Total 17.2 M SNPs called

• Previously ~12M SNPs “known” (dbSNP 129)

– 7.9M confirmed– 9.2M novel

4.84

1.09

0.78

0.48

2.80 5.65

1.54

CEU YRI

CHB+JPT

0.50

0.38

0.29 0.26

2.20 4.38

1.35

CEU YRI

CHB+JPT

Total SNPs Novel SNPs

Le Quang

Page 12: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

A near complete record of common SNPs

Durbin, Le Quang

Page 13: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

0.88

0.90

0.92

0.94

0.96

0.98

1.00

HomRef Het HomNonRef Average

CEU

JPT

CHB

YRI

A set of accurate genotypes

• This is about where simulations suggest we should be with 2-4x on 60 samples

• Note this quality is much much better than if calls were made marginally

Durbin, Le Quang

Page 14: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Many novel indels and larger structural variants

Page 15: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Ref-free in

serti

ons

Ref-assi

sted in

serti

ons

Ref-free deletions

Ref-assi

sted deletions

0

500

1000

1500

2000

2500

3000

Calls>50bpDGV1000g release

Zam Iqbal

Up to 50kb

Novel sequence from de novo assembly

Page 16: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Some interesting biology - variation in SNP density

Page 17: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Some more interesting biology – high Fst SNPs

Ryan Hernandez, Adam Auton

Page 18: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Even more interesting biology – loss of function mutations

Daniel MacArthur

Page 19: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

A robust and modular pipeline for analysis of population-scale sequence data

Page 20: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

An efficient format for storing aligned reads and a set of tools to manipulate and view the files

• SAM/BAM format for storing (aligned) reads

Bioinformatics (2009) http://samtools.sourceforge.net

Page 21: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

An information-rich format for storing generic haplotype/genotype data and tools for manipulating the files

www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2

Page 22: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Using the 1000G data now

Page 23: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

IMPUTE

Genotypes in additional samples from standard product

Reference panel(1000G)

Imputation

… 11101010101011 …… 00111110000111 …… 11110000011101 …… 00101011100101 …   … 1.2..1.0.0..22…

… 11220110200122 … Imputed genotypes

Page 24: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Imputation performance across SNP types from P1 (CEU) from Affy 500k

Annotation # SNPs Info measure

All 414,321 0.780

MAF < 5% 102,000 0.543

MAF > 5% 312,321 0.857

UCSC Genes 6,628 0.736

Depth < 100 3,153 (0.7%) 0.611

SimpRpts 25,625 0.607

SimpRpts + Depth < 100 1,652 (6.5%) 0.671

SegDups 24,301 0.686

SegDups + Depth < 100 665 (2.7%) 0.388

Jonathan Marchini

Page 25: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Looking forward...

• Already have data generated for c. 200 more Europeans– Data generation largely complete by mid 2010

• Much work still to be done on accurate inference of all types of variation from NGS data

• Data already proven useful for a number of projects – please use it

Page 26: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Thanks to the many...