SNP Resources: Finding SNPs, Databases and Data Extraction

Post on 17-Jan-2016

45 views 0 download

Tags:

description

SNP Resources: Finding SNPs, Databases and Data Extraction. Debbie Nickerson debnick@u.washington.edu SeattleSNPs. Complex inheritance/disease. Many Other Genes. Variant Gene. Environment. Disease. DiabetesHeart DiseaseSchizophrenia ObesityMultiple SclerosisCeliac Disease - PowerPoint PPT Presentation

Transcript of SNP Resources: Finding SNPs, Databases and Data Extraction

SNP Resources: Finding SNPs,SNP Resources: Finding SNPs,Databases and Data ExtractionDatabases and Data Extraction

Debbie NickersonDebbie Nickersondebnick@u.washington.edu

SeattleSNPsSeattleSNPs

Complex inheritance/disease

Variant Gene

Disease

Diabetes Heart Disease SchizophreniaObesity Multiple Sclerosis Celiac DiseaseCancer Asthma Autism

Many OtherGenes

Environment

Two hypotheses:Two hypotheses:1- common disease/common variant?1- common disease/common variant?2- common disease/many rare variants?2- common disease/many rare variants?

Copy-Number Variants

Genomic VariationFr

equ

ency

Size

Single Nucleotide Polymorphisms

Small indels

cytogenetic

structural variation

duplications

deletionsinsertions

inversions

Human Genetic Variation

• Gene-rich, eg immune response, drug metabolism

• Abundant1 bp 1 chr

Total sequence variation in humans

Population size: 6x109 (diploid)

Mutation rate: 2x10–8 per bp per generation

Expected “hits”: 240 for each bp

Every variant compatible with life exists in the population

BUT: Most are vanishingly rare

Compare 2 haploid genomes: 1 SNP per 1331 bp*

*The International SNP Map Working Group, Nature 409:928 - 933 (2001)

Building Maps of Single Nucleotide Polymorphisms

(SNPs)

ATTCGGCATGAAATTCGGGATGAA

Developed in two overlapping phases:Developed in two overlapping phases:1)1) SNP DiscoverySNP Discovery2)2) SNP GenotypingSNP Genotyping

Finding SNPs: Sequence-based SNP Mining

RANDOM Sequence Overlap - SNP Discovery

GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC

Genomic Genomic

RRSRRSLibraryLibrary

ShotgunShotgunOverlapOverlap

BACBACLibraryLibrary

BACBACOverlapOverlap

DNASEQUENCING

mRNAmRNA

cDNAcDNALibraryLibrary

ESTESTOverlapOverlap

RandomRandomShotgunShotgun

Align toAlign toReferenceReference

> 11 Million SNPs

G

C

Validated - 5.6 MILLON SNPS

Increasing Sample Size Improves SNP DiscoveryIncreasing Sample Size Improves SNP Discovery

GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC{{2 chromosomes2 chromosomes

0.0 0.2 0.3 0.4 0.50.10.0

0.5

1.0

Minor Allele Frequency (MAF)

2

88

4824

16

8

96

HapMapHapMapBased on Based on ~ 6-8 ~ 6-8 ChromosomesChromosomesrandomrandom

Candidate Candidate GeneGeneSequencingSequencing

New 1000 Genome

Program

Fra

ctio

n o

f S

NP

s D

isc

ove

red

Genotype - Phenotype Studies

What SNPs are available?

How do I find the common SNPs?How do I find the common SNPs?

What is the validation/quality of the SNPs?What is the validation/quality of the SNPs?

Are these SNPs informative in my population/samples?Are these SNPs informative in my population/samples?

What can I download information?What can I download information?

How do I pick the “best” SNPs? - Dana CrawfordHow do I pick the “best” SNPs? - Dana Crawford

You have candidate gene/region/pathway of interest and samples ready to study:

Minimal SNP information for genotyping/characterizationMinimal SNP information for genotyping/characterization

• What is the SNP? Flanking sequence and alleles. FASTA format>snp_nameACCGAGTAGCCAG[A/G]ACTGGGATAGAAC

• dbSNP reference SNP # (rs #)• Where is the SNP mapped? Exon, promoter, UTR, etc• How was it discovered? Method • What assurances do you have that it is real? Validated how?• What population – African, European, etc?• What is the allele frequency of each SNP? Common (>5%), rare• Are other SNPs associated - redundant?• Is genotyping data for control populations available?

Finding SNPs: Databases and Extraction

How do I find and download SNP data for analysis/genotyping?

1. SeattleSNPs - Candidate gene website

2. Other web applications 2. Other web applications GVSGVS

HapMap Genome BrowserHapMap Genome Browser

3. Entrez Gene- dbSNP- Entrez SNP

Finding SNPs: Databases and ExtractionFinding SNPs: Databases and Extraction

How do I find and download SNP data for analysis/genotyping?How do I find and download SNP data for analysis/genotyping?

1. SeattleSNPs - Candidate gene website1. SeattleSNPs - Candidate gene website

2. Other web applications 2. Other web applications GVSGVS

HapMap Genome BrowserHapMap Genome Browser

3. Entrez Gene3. Entrez Gene- dbSNP- dbSNP- Entrez SNP- Entrez SNP

Finding SNPs: Seattle SNPs Candidate GenesFinding SNPs: Seattle SNPs Candidate Genes

pga.gs.washington.edupga.gs.washington.edu

Finding SNPs: SeattleSNPs Candidate Genes Finding SNPs: SeattleSNPs Candidate Genes

Example - PCSK9

Finding SNPs: SeattleSNPs Candidate Genes Finding SNPs: SeattleSNPs Candidate Genes

Finding SNPs: SeattleSNPs Candidate Genes Finding SNPs: SeattleSNPs Candidate Genes

AD

ED

SNP_pos <tab> Ind_ID <tab> allele1 <tab> allele2SNP_pos <tab> Ind_ID <tab> allele1 <tab> allele2Repeat for all individualsRepeat for all individualsRepeat for next SNPRepeat for next SNP

PolyPhen - PolyPhen - PolyPolymorphism morphism PhenPhenotypingotypingStructural protein characteristics and evolutionary comparisonStructural protein characteristics and evolutionary comparison

SIFT = Sorting Intolerant From TolerantSIFT = Sorting Intolerant From TolerantEvolutionary comparison of non-synonymous SNPsEvolutionary comparison of non-synonymous SNPs

Finding SNPs: SeattleSNPs Candidate GenesFinding SNPs: SeattleSNPs Candidate Genes

pga.gs.washington.edupga.gs.washington.edu

Finding SNPs: Databases and ExtractionFinding SNPs: Databases and Extraction

How do I find and download SNP data for analysis/genotyping?How do I find and download SNP data for analysis/genotyping?

1. SeattleSNPs - Candidate gene website1. SeattleSNPs - Candidate gene website

2. Other web applications 2. Other web applications GVSGVS

HapMap Genome BrowserHapMap Genome Browser

3. Entrez Gene3. Entrez Gene- dbSNP- dbSNP- Entrez SNP- Entrez SNP

Provides rapid analysis of 4.5 million Provides rapid analysis of 4.5 million genotypedgenotyped SNPs from SNPs from dbSNP and the HapMap dbSNP and the HapMap

Mapped to human genome build 36 (hg18)Mapped to human genome build 36 (hg18) Displays genotype data in text and image formatsDisplays genotype data in text and image formats

Displays tagSNPs or clusters of informative SNPs in text and Displays tagSNPs or clusters of informative SNPs in text and image formatsimage formats

Displays linkage disequilibrium (LD) in text and image Displays linkage disequilibrium (LD) in text and image formatsformats

Online tutorial provided at OpenHelix.comOnline tutorial provided at OpenHelix.com

GVS: Genome Variation Serverhttp://gvs.gs.washington.edu/GVS/http://gvs.gs.washington.edu/GVS/

GVS: Genome Variation Server

http://gvs.gs.washington.edu/GVS/http://gvs.gs.washington.edu/GVS/

LDLR

GVS: Genome Variation Server

GVS: Genome Variation Server

•Table of genotypes•Image of visual genotypes

GVS: Genome Variation ServerGenotypes displayed in prettybase table and visual genotype graphicGenotypes displayed in prettybase table and visual genotype graphic

GVS: Genome Variation Server

GVS: Genome Variation ServerDense genotypes around a candidate gene can be integratedDense genotypes around a candidate gene can be integratedwith broader HapMap genotypeswith broader HapMap genotypes

= Seattle \SNP discovery (1/200 bp)

= HapMap SNPs (~1/1000 bp)

High Density Genic Coverage (SeattleSNPs)

Low Density Genome Coverage (HapMap)

GVS: Genome Variation ServerDense genotypes around a candidate gene can be integratedDense genotypes around a candidate gene can be integratedwith lower-density HapMap genotypeswith lower-density HapMap genotypes

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

GVS: Genome Variation Server

CombinedCombined

CommonCommon

A.A. Common samples-Common samples-combined variationscombined variations

B. Combined samples- B. Combined samples- common variationscommon variations

C.C. Combined samples- Combined samples- combined variationscombined variations

GVS: Genome Variation Server

A.A. Common samples- combined variationsCommon samples- combined variations

Combined variationsCombined variations

-Com

mon

sam

ples

--C

omm

on s

ampl

es-

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

GVS: Genome Variation ServerB. Combined samples- common variationsB. Combined samples- common variations

-Com

bine

d sa

mpl

es-

-Com

bine

d sa

mpl

es-

Hap

Map

Hap

Map

Se

att

leS

NP

s

GVS: Genome Variation Server

C. Combined samples- combined variationsC. Combined samples- combined variations-C

ombi

ned

sam

ples

--C

ombi

ned

sam

ples

-

Combined variationsCombined variations

Finding SNPs: Databases and ExtractionFinding SNPs: Databases and Extraction

How do I find and download SNP data for analysis/genotyping?How do I find and download SNP data for analysis/genotyping?

1. SeattleSNPs - Candidate gene website1. SeattleSNPs - Candidate gene website

2. Other web applications 2. Other web applications GVSGVS

HapMap Genome BrowserHapMap Genome Browser

3. Entrez Gene3. Entrez Gene- dbSNP- dbSNP- Entrez SNP- Entrez SNP

www.hapmap.orgwww.hapmap.org

Finding SNPs: HapMap BrowserFinding SNPs: HapMap Browser

Finding SNPs: HapMap BrowserFinding SNPs: HapMap Browser

1.1. HapMap data sets are useful because individual genotype HapMap data sets are useful because individual genotype data in deeply sampled populations can be used to data in deeply sampled populations can be used to determine optimal genotyping strategies (tagSNPs) or determine optimal genotyping strategies (tagSNPs) or perform population genetic analyses (linkage perform population genetic analyses (linkage disequilbrium)disequilbrium)

2.2. Data are specific to the HapMap project (not all dbSNP)Data are specific to the HapMap project (not all dbSNP) HapMap data is available in dbSNPHapMap data is available in dbSNP

3.3. Visualization of data and direct access to Visualization of data and direct access to SNP data, individual genotypes, and LD analysisSNP data, individual genotypes, and LD analysis

possible in the browser and formats can be saved possible in the browser and formats can be saved for Haploviewfor Haploview

Finding SNPs: Databases and ExtractionFinding SNPs: Databases and Extraction

How do I find and download SNP data for analysis/genotyping?How do I find and download SNP data for analysis/genotyping?

1. SeattleSNPs - Candidate gene website1. SeattleSNPs - Candidate gene website

2. Other web applications 2. Other web applications GVSGVS

HapMap Genome BrowserHapMap Genome Browser

3. Entrez Gene3. Entrez Gene- dbSNP- dbSNP- Entrez SNP- Entrez SNP

NCBI - Database ResourceNCBI - Database Resource

www.ncbi.nlm.nih.gov

PCSK9

Finding SNPs using NCBI databasesFinding SNPs using NCBI databaseshttp://www.ncbi.nlm.nih.gov/

DefaultDefaultView cSNPsView cSNPs

Finding SNPs using NCBI databasesFinding SNPs using NCBI databaseshttp://www.ncbi.nlm.nih.gov/

PCSK9

Finding SNPs - Entrez SNP SummaryFinding SNPs - Entrez SNP Summary

1.1. dbSNP is useful for investigating detailed information on a dbSNP is useful for investigating detailed information on a small number SNPs - and it’s good for a picture of the genesmall number SNPs - and it’s good for a picture of the gene

2.2. Entrez SNP is a direct, fast database for querying SNP dataEntrez SNP is a direct, fast database for querying SNP data

3.3. Data from Entrez SNP can be retrieved in batches for many SNPsData from Entrez SNP can be retrieved in batches for many SNPs

4.4. Entrez SNP data can be “limited” to specific subsets of SNPsEntrez SNP data can be “limited” to specific subsets of SNPsand formatted in plain text for easy parsing and manipulationand formatted in plain text for easy parsing and manipulation

5.5. More detailed queries can be formed using specific “field tags” More detailed queries can be formed using specific “field tags” for retrieving SNP data for retrieving SNP data

SummarySummary Finding SNPs: Databases and ExtractionFinding SNPs: Databases and Extraction

Reviewing candidate genes using views and resources in Reviewing candidate genes using views and resources in

- SeattleSNPs- SeattleSNPs

Integration of dense, gene-centric SNP maps with genomic Integration of dense, gene-centric SNP maps with genomic HapMap SNPs HapMap SNPs

- GVS - GVS

HapMap viewerHapMap viewer

NCBI databases through Entrez portal-Entrez Gene, dbSNP, Entrez SNP-many ways to retrieve and format data

Genome Variation Server: GVS

GWAS AsthmaMoffatt et al Nature 448: 470-473, 2007

New Variation to Consider - Structural Variation

Types of Structural Variants

Insertions/DeletionsInversions DuplicationsTranslocations

Size:Large-scale (>100 kb) intermediate-scale (500 bp–100 kb)Fine-scale (1–500 bp) More than 10% of

the genome sequence

Nature 447: 161-165, 2007

Detection of Outliers of the Distribution

X-linked SNPX-linked SNP Unknown SNPUnknown SNP

Genetic Strategy - New Insights

allele frequency HIGHLOW

effectsize

WEAK

STRONG

LINKAGE ASSOCIATION

??

Ardlie, Kruglyak & Seielstad (2002) Nat. Genet. Rev. 3: 299-309

Common DiseaseMany Rare Variants

High Density Lipoprotein (HDL)

Sequencing Known Candidate Genes for Functional VariationFrom Individuals at the Tails of the Trait Distribution

Low HDL High HDLInd

ivid

uals

ABCA1 and HDL-C

• Observed excess of rare, nonsynonymous variants in low HDL-C samples at ABCA1

• Demonstrated functional relevance in cell culture

–Cohen et al, Science 305, 869-872, 2004

Many examples emerging

Common Disease Rare Variants

Personalized Human Genome Sequencing

Solexa - an example