Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Post on 11-Jan-2016

58 views 1 download

Tags:

description

Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao Department of Psychiatry and Center for the Study of Biological Complexity June 28, 2004 Email: zzhao@vcu.edu. Organization. Introduction to single nucleotide polymorphism (SNPs) - PowerPoint PPT Presentation

Transcript of Introduction to Single Nucleotide Polymorphisms (SNPs) Zhongming Zhao

Introduction to Single Nucleotide Polymorphisms (SNPs)

Zhongming Zhao Department of Psychiatry and Center for the Study of Biological ComplexityJune 28, 2004

Email: zzhao@vcu.edu

Introduction to Single Nucleotide Polymorphisms (SNPs)

Zhongming Zhao Department of Psychiatry and Center for the Study of Biological ComplexityJune 28, 2004

Email: zzhao@vcu.edu

Organization

Introduction to single nucleotide polymorphism (SNPs)

An overview of mammalian genome projects

Online resource of SNPs and genome sequences

SNPs

SNPs are DNA sequence variations that occur when a single nucleotide (A, T, C, or G) is altered (a single base variation).

Single Nucleotide Polymorphism

G A

C C

G A

C T

G/A

Sequence Alignment

Alignment of 16 SARS genome sequences by program Clustal W

SNPs in Substitution Types

To From A C G T

A

C

G

T

R: A/G

Y: C/T

M: A/C

K: G/T

W: A/T

S: C/G

Distribution of Substitutions

Data A/G (%) C/T (%) A/C (%) G/T (%) A/T (%) C/G (%) Ts (%) Ts/Tv

Mouse dbSNP 34.11 33.94 8.63 8.60 8.39 6.32 68.05 2.13

Mouse Celera 33.35 33.33 9.13 9.08 8.83 6.29 66.67 2.00

Human 33.12 33.15 8.74 8.77 7.42 8.80 66.28 1.97

0

5

10

15

20

25

30

35

40

A/G C/T A/C G/T A/T C/G

Pro

po

rtio

n (

%)

Mouse dbSNP

Mouse Celera

Human

Disease Studies− Causes of genetic diseases− Association studies of complex diseases

Population Studies− Population structures and history− Haplotype analysis

Functional Analysis− Pharmacogenomics

Genome Mapping− Dense/fine marker set− Haplotype map

Comparative Genomics− Genome evolution− Mechanism of molecular evolution

SNPs are Valuable Tools in Genetic Analysis

Public: NCBI dbSNP TSC Whitehead Institute

SNP Database HGMD HGBase (now HGVD) UCSC Genome

Browser Ensembl Mouse Phenome

Database

Private Celera RefSNP Sequenom RealSNP Incyte SNP Program

SNP Databases

Celera RefSNP: Celera CgsSNP: identified

by the computational method from five individuals’ genomic sequences

Most SNPs are mapped dbSNP HGMD HGBase 5.0 million human SNPs 3.1 million mouse SNPs

NCBI dbSNP Launched in Sept. 1998 Data are deposited by various

sources rs: grouping of identical,

independent submissions of variation

Recomputed in builds based on incremental freezes

24 Species Over 19 million submissions

SNP Databases

NCBI dbSNP

dbSNP& genome build cycle

Locus LinkLocus Link

data data dumpdump

MSSQLMSSQL

FASTAFASTA

submissionsubmission

RefSNPRefSNPdocsum setdocsum set

asn.1 + XMLasn.1 + XML

link link Calculation &Calculation &annotationannotation

MapViewMapView

RefSeqRefSeq

GenomeGenomesequencesequence

rsrssetset

new new ss ss

accessionsaccessions

setsetRecalculation & mappingRecalculation & mapping

• Rs ID anchors links back to dbSNP

• Checkpoint for data synchronization

• Synchronized with NCBI genome assembly pipelines

denormalizationdenormalization

dbSNP growthhuman data 1998-

2003

2.1M SNPs in first comprehensive map: Nature 2001

First TSC submission towards their goal of 200K SNPs

Computational mining from genome clone seq. ramps up

HapMap begins additional 6x shotgun coverage

June 2004: 9.8M refSNPs. 2005: Perlegen+NHGRI+??

12-15M

Human Variations in dbSNP Build 121

Total submissions (all ss#): 19,888,389Total Non-redundant submissions: 9,856,125

‘SNP’ class 9,170,759Uniquely mapped (ref only) 8,549,864Unique + SNP 7,946,976

Mapping SNPs to the Genome

• Format the flanking sequences of SNPs (e.g. 50 bp each side)

• Using alignment program BLAST or BLAT with the following criteria:

•0 gap in the aligned region

•The SNP position is within the aligned region

•Aligned region at least 100 bp in length

•Only 1 ambiguous letter matches

•No more than 1% sequence mismatches in the aligned region

Most SNPs Map Uniquely during Genome Annotation

71,503

1,661

473,215

5,088

38,124

4,899,650

87,155

430,839

6,524

100

1,000

10,000

100,000

1,000,000

10,000,000

Once Twice 3 - 10 11+ Masked

Hits to Genome

Human

Mouse

Rat

Mosquito

FASTA Format and Data Structure for a rs Record

define for FASTA records start with ">" | object-type=general

| |

| | database name

| | | offset taxID list of

| | | rs# | length | SNP class alleles

| | | | | | | | |

define:>gnl|dbSNP|rs271_allelePos=51totallen=101|taxid=9606|snpClass=1|alleles='G/A'

5' sequence: CTGCATCACA TGTACTGATT CTGTCCATTG GAACAGAGAT GATGACTGGT

variation: R

3' sequence: TTACTAAACC CTGAGCCCTG GTGTTTCTGT TGATAGGGGG TTGCATTGAT

http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs271

The SNP Consortium (TSC)

The SNP Consortium (TSC)

• The SNP Consortium (TSC) is a public/private collaboration that has to date discovered and characterized nearly 1.8 million SNPs

• The TSC was funded by 11 corporate members and the Wellcome Trust.

• Started in April 1999 and that time its mission is to develop up to 300,000 SNPs distributed evenly throughout the human genome. Finally, in 2001, it finished by 1.5 million SNPs

• Well designed. Good quality of SNP data and allele frequencies.

Celera CDS

The Sequenom’s RealSNP

• Aims to develop assays for Sequenom’s Mass Spec Genotyping machine.

• Most candidate SNPs were obtained from dbSNPs, some were from Incyte’s proprietary SNPs

• Started in 2002

• Over 5.4M designed SNP assays

• Over 400,000 working assays

• Over 220,000 confirmed polymorphic SNPs

Distribution of Heterozygosity: 1.42 million SNP Map

• The genome was divided into contiguous bins of 200,000 bp. A histogram was generated of the distribution of heterozygosity values across all such bins.

• Heterozygosity was calculated across contiguous 200,000-bp bins on Chromosome 6. The blue lines represent the values within which 95% of regions fall: 2.0  x 10-4 - 15.8 x 10-4. Red, bins falling outside this range. The extended region of unusually high heterozygosity centred at 34 Mb corresponds to the HLA.

• Correlation of nucleotide diversity with GC content of each read (autosomes only). Higher GC content, higher nucleotide diversity.

• Nature 2001 409:928-933

HLA

• To develop a haplotype map of the human genome

• To describe the common patterns of human DNA sequence variation

• U.S.A., Japan, the U.K., Canada, China, and Nigeria

• Over A total of 270 people•Yoruba, Nigeria (30 both-parent-and-adult-child trios)

•Japanese (45 unrelated individuals)

•Han Chinese (45 unrelated individuals)

•CEPH (30 trios)

• Genotyped for at least 1 million SNPs evenly across the human genome

The Human Genome & Variation

Science February 2001 Nature February 2001

The Rodent Genome & Variation

December 5, 2002 Nature April 1, 2004

Human Genome Sequencing Project

International Human Genome Sequencing Consortium (IHGSC)− A collaboration of 20 groups from the USA, the United Kingdom, Japan, France,

Germany, and China− Goals: DNA sequence, genetic map, physical map, genetic variation, functional

analysis, etc.− A 15-year $3 billion project (1990-2005, finished 2001)− Hierarchical shotgun sequencing strategy

Celera Human Genome Project− Compete IHGSC from the biotech industry− Whole-genome shotgun sequencing (WGS) strategy− DNA samples from five individuals, mainly from Craig Venter

Many follow-up studies Chromosome 6, 7, 9, 10, 13, 14, 16, 19, 20, 21, 22 Comparative genomics

Nature 2001 409:860-921

Science 2001 291:1304-1351

Science 2003 300:286-290

The Automatic Production Line at the Whitehead Genome Sequencing Center

The Largest Government Projects Since 1990

Proposed Project Projected cost ($ billion)

Target completion date

Estimated life-span (years)

Space Station Freedom

30.0 1999 30

Earth Observing System

17.0 2000 15

Superconducting Super Collider

11.0 1999 30

Human Genome Project

3.0 2005 Perpetual

Hubble Space Telescope

1.5 1990 15-20

Science 2003 300:286-290

Mouse Genome Sequencing Project

Mouse Genome Sequencing Consortium (MGSC)− Whitehead/MIT Genome Center− Washington University Genome Sequencing Center− Wellcome Trust Sanger Institute− Ensembl

Hybrid Sequencing Strategy (WGS and hierarchical shotgun)

Single mouse strain C57BL/6J (female)

SNPs generated by WGS sequencing: 79,269 SNPs from four strains (C57BL/6J, 129S1/SvImJ, C3H/HeJ, BALB/cByJ)

Nature 2002 420:520

Nature 2002 470:574578

Rat Genome Sequencing Project

Rat Genome Sequencing Consortium (RGSC)− Led by Baylor Genome Sequencing Center (BCM-HGSC)− International collaboration including Celera Genomics

Combined Strategy: WGS and BAC Sequencing

Brown Norway rat (most sequences from two females)

The rat genome (2.75 Gb) is smaller than the human (2.9 Gb) but larger than the mouse (2.5 Gb?)

These three genomes encode similar numbers of genes

Almost all human genes known to be associated with disease have orthologues in the rat genome

About a billion nucleotides (~40% of the euchromatic rat genome) in in the orthologous alignment among human/mouse/rat.

Nature 2004 428:493-521

Hypermutability of CpG

CG TGGC AC

Mouse (32) Human (34)CG -3.52% -3.19%TG +1.38% +1.21%CA +1.38%` +1.21%

30,000 to 45,000 CpG islands in the human genome (Science 2001) 45,000 and 37,000 in the human and mouse genomes (PNAS 1993, 90:11995) 27,000 and 15,500 in the human and mouse genome (Nature 2002)

+1

-1

Neighboring Nucleotide Bias of SNPs

-6

-4

-2

0

2

4

6

Position(bp)

Bia

s(%

)

A C G T

-4.44

-6

-4

-2

0

2

4

6

Position (bp)

Per

cen

tag

e o

f B

ias

%A %C %/G %T+4.91

-4.63

+5.05

-4.44

+2.58

-3.55

Mouse

Human

Map of Conserved Synteny between Human, Mouse, and Rat Genomes

Infer the Mutation Direction

• We have human SNPs with outgroup chimpanzee sequences (divergence time is about 4-6 million years, sequence difference is about 1.2%)

• We have mouse SNPs with outgroup rat sequences (divergence time is about 12-24 million years, sequence diversity is unknown )

Infer the Mutation Direction

A C C A A A Direction: A->C

A C C A A C Direction: C->A

Hum SNPs Chimp Oran

Web ResourcesWeb Resources NCBI dbSNP

www.ncbi.nlm.nih.gov/SNP

ftp.ncbi.nlm.nih.gov/snp

Celera Genomics: www.celera.com

The SNP Consortium (TSC): http://snp.cshl.org

UCSC Genome Browser: http://genome.ucsc.edu/

The Human Gene Mutation Database (HGMD): http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html

Human Genome Variation Database (HGVD): http://hgvbase.cgb.ki.se/

MIT SNP database: Human: http://www.broad.mit.edu/snp/human/Mouse: http://www.broad.mit.edu/snp/mouse/

Sequenom RealSNP: https://www.realsnp.com/default.asp

Ensembl Genome Browser: http://www.ensembl.org/ The HapMap Project: http://www.hapmap.org/

Mouse Phenome Database:

http://aretha.jax.org/pub-cgi/phenome/mpdcgi?rtn=projects/details&sym=Mpd1